Checksum errors without any identifiable data corruption

Hi All,

Recently my TrueNAS 24.04 system started to show checksum errors on a pool.

$ sudo zpool status pool0 -v
  pool: pool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 05:03:32 with 0 errors on Mon Dec 22 23:32:01 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool0                                     ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    14fd5317-e933-4fbb-90e1-723242e4d98b  ONLINE       0     0 4.20K
	    711697c4-77ef-4e4a-b97a-800e77418c93  ONLINE       0     0 4.20K
	    12f20257-f9bd-400d-a4c7-5e7e5cb63a41  ONLINE       0     0 4.20K

errors: List of errors unavailable: no such pool or dataset

However, no file errors are listed. Instead I get a no such pool or dataset error displayed.

Symptoms and history:

  • No visible issues at all. Everything seems to be working perfectly fine. I just have these ever increasing number of checksum errors.
  • The first error appeared during a resilvering afterI replaced an old drive with a new one and also copied a new file to one of the datasets. This file was then marked as corrupted.
  • After the resilvering has finished, I tried removing and restoring the file. This did not clear the error, so I ended up removing the entire dataset and restoring the files from backup. Since then I have the error displayed above.

My fix attempts so far:

  • I tried recreating the entire dataset that held the single file that was marked as corrupted.
  • I replaced every SATA cable with a brand new one. (Since I always see the exact same amount of errors on all of the drives, I did not expect too much from this attempt.)
  • Ran a 48+ hours memtest. 0 errors.
  • Cleared the error count with zpool clear and run multiple scrubs. (I read on the ZFS GitHub project support that the first scrub does not always clear checksum errors.)

I’m afraid I have a meta data problem and I don’t have any other choice than destroying and recreating the pool. Since this would pretty much mean a complete NAS rebuild and I have a large number of apps, I’m afraid this rebuild would mean several days if not weeks of work. I have the following questions:

  • Is there anything else that I could try to fix my pool before wiping it?
  • If I have to wipe, what would be the best path of doing this to keep it as simple and least painful as possible? Can you guide me to a documentation of a similar scenario?
  • I am thinking about to temporarily migrate my apps pool to the “backup” pool until I recreate and restore the main data pool. Should that work? Does TrueNAS copy the /mnt/.ix-apps/ folder and all app configurations when I set a different pool under Apps → Configuration → Choose pool ?
  • Which configuration settings will I lose when I destroy the pool? For example will the export settings remain but with an error, or they will be automatically wiped as well? Same for data protection settings. Will all configuration be automatically wiped or they will be preserved with errors?
  • In case nothing is being preserved, before clicking wipe, what TrueNAS config can I backup for a faster recover / reconfig? For example what does the backup file contains, that is being dumped before regular upgrades? Can I use this for a quicker reconfiguration?
  • Is there an ultimate restore guide that I could consult for learning as much as possible and plan my next move before I destroy anything?

Thank you for any kind of help.

I haven’t seen that before. I’m wondering if the 4.2KiB checksum across the board was for the 1 failed file that was replaced is historically logged & since the file doesn’t exist anymore it is just a black mark on the pool with no real impact.

I’m guessing there are no active alerts/alarms on the system? If so, I’d just live with it personally, sounds easier than remaking the pool.

The problem is that the number of checksum errors seem to be increasing. Today zpool status looks like this:

$ sudo zpool status pool0 -v
[sudo] password for sapo: 
  pool: pool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 05:03:32 with 0 errors on Mon Dec 22 23:32:01 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool0                                     ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    14fd5317-e933-4fbb-90e1-723242e4d98b  ONLINE       0     0 7.15K
	    711697c4-77ef-4e4a-b97a-800e77418c93  ONLINE       0     0 7.15K
	    12f20257-f9bd-400d-a4c7-5e7e5cb63a41  ONLINE       0     0 7.15K
	spares
	  19f61f8b-8d4d-454f-897a-712af481dd64    AVAIL   

errors: List of errors unavailable: no such pool or dataset

I guess I’ll wait a few more days to see if this trend stops or not.

Meanwhile it would be nice to find a TrueNAS documentation that describes a rebuild scenario when the “main” pool must be destroyed and re-created.

It’s not 4.2KiB. It’s 4,200 checksum errors. The next post by @sapo shows this count increased to 7,150 errors.

Because of so many all at once, and it affects multiple drives simultaneously with the same number, it’s very much likely an issue with an HBA controller.

What hardware do you have? Are you using an HBA? A “SATA card”?

2 Likes

My motherboard is a Supermicro M11SDV-8C-LN4F

The boot drive is an M.2 SSD. I connect the data disks to the motherboard’s SATA ports through a hot-swap backplane that came with the case which is some regular Mini-ITX eclosure.

Check that.

which one? :roll_eyes:
Proper hot-swap is not a common feature with consumer hardware.