Scrub reports corruption but file is perfect

Ian_Posner · February 12, 2025, 5:13pm

TrueNas Core 13.0-U6.7

Booting from a new high-quality SanDisk USB stick. It’s been in use for MONTHS, no problems. System Dataset is set to a hard disk array, NOT the boot pool, so disk getting little write activity.

Suddenly, scrub of boot-pool indicates corruption of a particular file stating that data corruption has occurred. So I check the sha256sum on the file in question and compare against the sha256sum of the same file from a freshly downloaded truenas image. They are identical.

With that in mind, I run a zpool clear on the boot-pool. Then re-scrub. Same exact error, same file again:

root@freenas:/usr/local/lib # zpool status -v boot-pool
  pool: boot-pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:07 with 3 errors on Wed Feb 12 10:57:52 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  da0p2     ONLINE      24     0    50

errors: Permanent errors have been detected in the following files:

        boot-pool/ROOT/13.0-U6.7@2024-09-25-13:58:19:/usr/local/lib/libopenblasp-r0.3.18.so

Same file, identical SHA256 check.

Does this indicate that it is the checksum that is incorrect? If so, what to do?

Ian_Posner · February 12, 2025, 5:48pm

So fixed my own problem! Guessing that the problem might be corruption of the checksum rather than the file, I copied the original file, deleted the original and renamed the copy in order to force the (re)creation of a new checksum. Checked the checksum and permissions were identical. Then ran a “zpool clear boot-pool” and then a rescrub of the boot pool. The problem was solved!

root@freenas:/usr/local/lib # zpool scrub -w  boot-pool
root@freenas:/usr/local/lib # zpool status -v boot-pool
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:09 with 0 errors on Wed Feb 12 17:31:30 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  da0p2     ONLINE       0     0     0

errors: No known data errors

So now you know what to do should you have a checksum corruption!

joeschmuck · February 12, 2025, 6:38pm

Just so I am clear, you copied the file that was said to be corrupt? Then deleted the original. Lastly you copied the original file back and verified the checksum was good now.

If that is what you did, you still have a corrupt file. You need to replace the file with a new file, not an exact copy of the original corrupt file.

Ian_Posner · February 12, 2025, 6:58pm

No Joe - Although the message said the file was corrupt, I downloaded a new copy of the same truenas version, opened the file archive and computed the sha256sum of the file. I then compared that checksum value with a sha256sum of the file truenas said was corrupted - the checksums were identical. In other words, it wasn’t that the file contents were corrupt, it was that the ZFS checksum value for that file was corrupt. So the file copy of the local file forced a recomputation of the checksum. This is completely valid because I proved that the file contents were NOT corrupted.

Arwen · February 12, 2025, 10:00pm

Yes, copying over a known good file can fix a corrupted file, (or ZFS checksum). I’ve done that specific fix on Solaris ZFS, (that poor server was ignored too long before I arrived…).

If the checksum was corrupt, it seems like a memory bit flipped. Not sure of the details, but you are likely using Non-ECC RAM.

With the wide spread use of ZFS, both with TrueNAS, and others, seeing odd behavior due to Non-ECC RAM is starting to become a thing. (Of course, with some other file systems, you might never know about a corrupt file… unless you needed it, and needed it perfect.)

Ian_Posner · February 13, 2025, 6:48am

Only thing is that I am using ECC RAM - the RAM was provided as ECC RAM by HP - it’s an HP Microserver.

More likely is that there has been on-disk rather than in-memory corruption. What’s interesting is that the pre-fix shows multiple errors on both the scrub reads and the checksum. Fixing the checksum on just one file seems to have reduced both of these figures to zero. This seems to indicate that while only one checksum was faulty, the scrub readout shows the impact on the entire merkel tree.

Arwen · February 13, 2025, 1:21pm

Interesting.

In theory, a bad checksum in metadata is non-impacting. ZFS by default has 2 copies of standard metadata. ZFS should have been able to recover 100% completely without user intervention by reading the redundant metadata. Then re-writing the faulty metadata to complete the recovery.

I’ve personally seen this in action. My media server has 2 storage devices, which I take a small piece from each for the OS pool. Then I stripe the rest for the media, no data redundancy, (but I have good backups). Every now and then I loose a media file, (generally a larger video file). Once I saw a “corrected” error and was speechless for weeks. At least until I remembered about the redundant metadata.

In order for both copies of the metadata having bad checksums for a child file, I would think it would have had to been caused in memory.

Of course, their is the potential for a bug in OpenZFS. It is under active development, with new features being added that can / could cause fully functional older code to become broken in odd ways. (See the block cloning bug…)

joeschmuck · February 13, 2025, 3:13pm

That is much better than I imagined. Most people will not take it to that level, glad there are people like you who will. Speaking for myself, I would have just deleted the old file and put on a new copy and call it a day. I can be lazy at times.

As for why it happened, we may never know. Could have been many different things. I would however, if you have the time, run some stress tests on the machine if this type of thing happens again. Given how deep you are willing to go to troubleshoot a problem, I suspect you would do that anyway.

Good luck!