Pool unhealthy, no idea what's going on: One or more devices has experienced an error resulting in data corruption

Having issues with a pool in TrueNAS Scale running on bare metal.

System hardware:

  1. HDDs of pool in question: Western Digital Red Pro 16TB x4 (connected directly to SATA ports on motherboard)
  2. Motherboard: ASUS ProArt Z790-CREATOR WIFI
  3. Memory: Corsair Vengeance - 96GB (2x48), DDR5, 5600 MHz
  4. CPU: Intel Core i9-14900K
  5. Power: Thermaltake Toughpower GF3 (1350 Watt) on UPS

1. Dataset Error on TrueNAS Scale v25.04.2.1
I have a dataset that I rarely unlock because it contains old archive data. A few days ago, I unlocked this dataset. When I navigated to it in a file manager, it was empty. When I clicked on the dataset in the Datasets tab of the web UI, I encountered the error: CallError. [EFAULT] Failed retreiving GROUP quotas for [pool]/[dataset]. I also noticed my ACLs for this dataset were gone.

I’ve seen others report this error, which may have been caused by migrating from Core to Scale. I haven’t done this, as I’ve always run Scale, but I did have to restore Scale from backup in March 2025 because my boot drive failed at this time.

I then noticed this Critical error in the web UI notifications: SMB shares have path-related configuration issues that may impact service stability. I haven’t changed SMB settings in quite some time.

Possible causes I was considering:

  1. Restoring TrueNAS from backup, and not having unlocked this dataset since then
  2. There was a pending update for v25.04.2.4, which fixes some SMB issues
  3. There are pending ZFS upgrades for all of my pools, although this appears to only be related to Fast Deduplication

Stopping and restarting the SMB service appeared to fix this problem, as I was able to unlock the dataset in question, the ACLs were restored to their original state, and I could see and open its files.

2. I then updated TrueNAS to v25.04.2.4

When I rebooted, the above issue with the archive dataset is gone, but I see two new issues:

  1. Critical: Pool <pool> state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
  2. I have a dataset that’s only used by the Syncthing app. After the first reboot from updating to v25.04.2.4, this was fine. On subsequent reboots, I now have to use the Force option to unlock this dataset, no clue why.

I rebooted a couple of more times and ran a scrub on the pool in question, which reports 0 errors.

When I run sudo zpool status in CLI, it shows:

  1. state: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
  2. errors: 1 data errors, use '-v' for a list

When I run sudo zpool status -xv and provide admin password, it shows the same thing except: errors: List of errors unavailable: permission denied (WHY?)

Then I ran a Long SMART test on each HDD (spinning) in the pool in question, via the web UI, which presented another issue: The web UI isn’t showing me that the tests are running. Was this always the case? I can only see that these are running via CLI. These are still running now, and I’ll report back with findings, but I have a suspicion they will show no errors.

Not looking for an exact diagnosis at this time, acknowledging more info will be needed, but can anyone first help me get TrueNAS to show me the 1 data errors it’s referring to in the CLI, but won’t show me when I enter my admin password?

Is there any way to get TrueNAS to forget/re-evaulate what it believes is corrupted? I see specific data errors listed in other users’ output, but not mine, which makes me wonder whether there is is actually an issue here.

I see no evidence of corruption, nor any issues with apps, VMs, replication, or Rsync backups.

Weird - any chance you can run it directly as root instead? Or directly on the system?

I swear I haven’t seen it run ongoing for a long time in the gui - I’ve always just checked progress in shell…

Ouch - maybe part of the known instabilities? I’ve had to manually set voltage and frequency curves on my 13900k to get stability at basically stock; much less fun than overclocking for performance…

1 Like

The Long SMART tests completed on each disk in the pool and report 0 errors.

I was able to get TrueNAS to show me the 1 data errors. The issue was the dataset containing the errors was locked when I ran sudo zpool status. It’s disappointing it didn’t just tell me that, and instead claimed that the admin didn’t have permission.

Now it shows me 244 errors, which appear to be snapshots:

errors: Permanent errors have been detected in the following files:

        POOL_NAME/DATASET_NAME@auto-2025-09-14_04-00:<0x1>
...
        POOL_NAME/DATASET_NAME:<0x1>
...
        POOL_NAME/DATASET_NAME@auto-2025-02-16_04-00:<0x1>

All of the errors it reports are of the same syntax (POOL_NAME/DATASET_NAME@auto-YYYY-MM-DD_HH-MM:<0x1>) except this one: POOL_NAME/DATASET_NAME:<0x1>. Any idea what this indicates?

Sorting the errors chronologically, I see that every daily snapshot of this dataset is in the error list dating back to 2025-01-15, and includes the snapshot that was taken this morning. I’m curious as to why this morning’s snapshot is corrupt, but none of the snapshots from other datasets taken this morning are. Perhaps the issue started with the 2025-01-15 snapshot and necessarily corrupts all future ones.

I deleted the corrupted 2025-01-15 snapshot and confirmed it’s removed from the list of errors.

I tried root but couldn’t switch to that user.

I could have sworn Scale used to show me these tests running in Running Jobs, at least on HDDs.

Possibly. I’ll look into it. I know I applied the motherboard updates to address the microcode issues related to voltage as soon as they came out. Of course, it’s possible some damage may have been done, but I’m not sure how to verify that.

Repeatedly cinebench (I enjoyed r23 since it would crash the fastest) or prime 95. Mine was so bad that it would be obvious in the first 30 seconds. Eventually it’d be solid for an hour. After I got it stable for ~12 hours, I decided it was good enough for a gaming rig.

1 Like

Interesting update:

  1. After deleting the oldest corrupted snapshot, I ran a scrub on the offending pool
  2. The scrub has completed with no errors
  3. The pool is now labeled as healthy by GUI and command line: sudo zpool status -xv -> all pools are healthy

I still have no clue what happened. At no point were any ZFS or SMART errors identified. Will get back if the issue resurfaces.

1 Like