Solved (for me) Metadata 0x0 checksum on ZFS, without rebuilding!

DireWolf · October 3, 2024, 12:11am

My large Truenas Scale ZFS pool had checksum errors on one of the 4xRaidZ1 VDEVs in the pool.

Everywhere I looked, the Truenas Scale “gurus” only suggested rebuilding the pool. I’m no expert in Truenas Scale, but I didn’t like that answer, so I kept digging. Here’s my situation:

I’m running a Truenas Scale system with a 65TB pool, spread across 12 disks of 12TB each. The setup is configured as 4x RaidZ1, which gave me the 800MBps I was looking for on my 10Gb fiber network. And it works great! But here’s where I messed up: I didn’t fully read before making changes, and since the array was empty at the time, I thought it would be fine to experiment. What I did was enable deduplication. Lesson learned: never, ever do that without knowing what you’re getting into.

I had 256GB of RAM, but even so, deduplication quickly ate up way too much memory. I disabled it almost a day after enabling it, but by that time, the damage was done. The problem I didn’t notice immediately was a checksum error. One of my RaidZ1 groups (3 drives) in the pool consistently showed 2044 errors after every scrub, triggering a flood of alerts. When I checked the pool the error was on a file Metadata 0x0 which I could not find anywhere but seemed to tie back to deduplication metadata which was no longer enabled.

I tried everything: I replaced each drive in that group with new ones, swapped them in and out for testing, but the errors followed the new drives. I changed cables, improved cooling, even swapped the controller, and was just about ready to back everything up and start fresh with a new pool. That’s what all the “gurus” suggested, after all.

But then I started thinking about the checksums. I remembered that during an update to Truenas, the checksum algorithm had changed from SHA256 to SHA512. So I went into System Settings, navigated to Storage > Dataset > Edit Dataset (for the only pool I have) > Advanced, and changed the checksum from SHA512 back to SHA256. After scrubbing, guess what? All the errors disappeared.

Does that mean I’m still a noob when it comes to Truenas Scale? Maybe (Probably!). I may not know much about Truenas Scale, but fixing this issue after nearly wiping my pool made me want to share this for anyone else in the same situation—especially if you’ve messed with deduplication early on and later faced checksum errors in Metadata 0x0 that wouldn’t scrub clean.

Protopia · October 3, 2024, 9:35am

The default checksum type (at least on my system[1]) for a root dataset is “on” which means “fletcher4” for non de-dup pools and “SHA256” for de-dup pools, and the default for sub-datasets is “inherit”. But if you migrated the pool / dataset from an earlier version it might have been set to SHA512, or it might have been set manually.

Only SHA256 or SHA512 (or “skein” or “blake3”) are supported for de-dup, so if it wasn’t set to one of these before you enabled de-dup, then ZFS would presumably change it to one of these when you did enable it.

See: Checksums and Their Use in ZFS — OpenZFS documentation

EDIT: I found an old CORE community post that says that for performance reasons the TN UI selects SHA512 when you turn de-dup on.

[1] The documentation says that ZFS microbenchmarks the best default at boot time. But the two results files (for fletcher4 and for all others) cannot be compared with each other, so I have no idea what my microbenchmark results are or what ZFS will use as a default). What I do know is that all my pools including the bootpool use checksum=on.
“The block checksum is calculated when the block is written, so changing the algorithm only affects writes occurring after the change.” Since this is a dataset block-level setting, this would presumably apply also to metadata blocks related to that dataset (and any sub-datasets that have “inherit” set).

I am therefore somewhat sceptical that simply changing the type (which presumably equates to the TrueNAS middleware issuing a zfs set checksum=XXXXX pool_name/dataset_name under the covers) would actually fix this.

Equally I would be surprised that Scrub would think it was fixed if in reality it wasn’t.

The only explanation I can think of is that the metadata block that was corrupted was the one holding the dataset definition, and that when you edited the definition and resaved it, a new metadata block was written and the old one now only exists in any snapshots you may have. IF this is the case, then I think it was only the extreme co-incidence that this was the corrupt metadata block that made this fix work.
I can understand why people say that rebuilding the pool from scratch is the only way to fix it, but that may simply be because the Scrub output is somewhat vague about the details of which metadata block is corrupted. Since the vast majority of metadata blocks are associated with a specific dataset, then IF you can work out which dataset has the broken metadata block, you should then be able to create a new dataset, copy the data without block cloning to the new dataset and then delete the old one, and this might well resolve the issue without needing to destroy the pool and rebuild it from scratch from backups. But so far, I haven’t seen anyone show how to work out which dataset a metadata block applies to.

etorix · October 3, 2024, 10:56am

Dedup takes about 5 GB of RAM per TB of deduped data, so your 65 TB pool could easily overflow a seemingly comfortable 256 GB RAM.
Dedup is a resource hog.

Note that any deduped data is still deduped, you need to delete the corresponding datasets and restore from backup, or replicate to a non-depdup datset (untick “Full file system replication”) to get rid of dedup.

“metadata” is not any user-accessible file, it is internal ZFS stuff and there’s no easy way to know what it relates to. If there are also errors in actual files, it might be possible to clear the errors by deleting the files and all snapshots which contain them. But the only foolproof way to clear error is to destroy and restore from backup.
“ZFS gurus” are not naughty by nature.
And an advice is not bad merely because you don’t like it.

You might have encountered some bug where ZFS does not properly record a change of default checksum algorithm. (Also note that dedup relies on checksums, so messing with algorithms after the fact will reduce dedup efficiency.)

Some more unasked for advice you may not like: Your pool has 8 drives worth of data and 4 drives worth of parity, yet loosing one drive would put your entire pool at risk if another incident occurred in the degraded vdev before resilver is complete. 2 * 6-wide raidz2 would give you the same storage efficiency (8:4) and throughput but better resiliency (can lose one drive withour risk, and any two drives as long as no further incident occurs while resilvering), at the cost of half the IOPS. However the only way to change that is… you guessed it.

DireWolf · October 4, 2024, 8:41pm

HA! I 100% agree! I am an IT systems and network engineer so I just don’t like going to the last option before trying more things. I got lucky here, but thought I would pass the info along since I didn’t see anyone posting good info about it while I went through this and scrubbed for months on new drives.

DireWolf · October 4, 2024, 8:45pm

Ah I didn’t address this but I used a rebalancing script that worked great to fix my deduped files. So … maybe that helped with the metadata issue once the 8 gig or so of deduped data was spread and re-written without dedupe. I turned it on, then turned it right off but didn’t see this problem coming until after I had messed with it AND not rebuilt it (which I should have done and likely would have avoided my checksum issue before it ever started).

DireWolf · October 4, 2024, 9:06pm

I appreciate your thoughtful reply here, and wanted to thank you and respond to this part about throughput. In the older forums, which are now read-only, the subject of throughput was tossed around quite a bit. The common answer was that the Vdev count when stacked as a single drive, decides your total throughput of a pool. I have found that to be true in testing. And since I have a 10Gbe network, I could not achieve the 1200MB +/- that a 10Gbps network would allow (on paper). In my setup, I have 12 drives, and 2 hot spares. A great test for me involved mirrored vdevs in a pool so 6 drives worth of stacked throughput at 200MBps +/- per drive. But it did not provide enough storage for my projects. Or so my brain at the time had me believing. In hind-sight, that would have been best because I have other places to store my stupid space-hungry data. My most important data is rsync replicated to other storage. But with 4 vdevs, I do still get 800MBps which is not bad and lets me have 65-ish TB attached to my systems with the same (albeit MUCH lower IOPS) throughput as a raid0 SATA 3 SSD. My understanding, based on my testing and reading the old forum, is that a 2 x6 wide raidz2 vdev pool would equate to only about 400MBps of throughput in my scenario where each drive is delivering 200MBps of throughput. Please educate me on this if that is incorrect. My testing with different vdevs indicated it was a reliable calculation, but I did not set up a 2x6-wide RaidZ2 pool in my direct testing.

Protopia · October 4, 2024, 9:21pm

IMO this would NOT cause a Scrub to change its results.

A scrub reads each block (which includes a flag telling it which checksum was used for that block when it was written, and then checks that the block and its checksum match using that algorithm.

Changing the dataset checksum algorithm only applies to blocks written after the change was made. So I cannot see how changing this would have made a metadata block correct itself.

The only explanation which makes sense is that the metadata block was one that didn’t need to be read in order for it to be rewritten, and the only one that comes to mind is the metadata block that is rewritten by ZFS when the middleware issues ZFS set commands after you edit the dataset checksum type i.e. the metadata block holding the dataset details. This block is not read by the middleware to get the current checksum type - that info is held in a separate and parellel TrueNAS UI database as far as I am aware.

DireWolf · October 4, 2024, 9:22pm

Thank you for the reply here. I hope it also helps others besides me as we share information. I didn’t mention until a response today that I fixed some of my deduplication or … removed it… by using a rebalancing script which just copies files and renames them, deletes the original, then renames the copies to the original name, etc. I say that for anyone who is not aware what rebalancing is, I’m sure you already do. So that does not solve the mystery but it may shed light on this where the change in checksum would have been recalculated in the rebalancing of about 30TB of data until it found and cleared my deduped files. That may have been it, and not the scrub at all. But the scrub was needed to confirm the checksums so until it was completed, the pool still showed unhealthy. I can’t be the only silly person who enabled Dedupe thinking it was a panacea but not reading about it enough here (!!!) and then turned it off but did not run a backup and rebuild a nice new pool. And then later find this metadata 0x0 file show up with checksum errors. I counted about 9 different folks asking about metadata 0x0 and checksum errors that even survived drive replacements in the vdevs. But nobody ever indicated they tested dedupe, turned it off, and didn’t rebuild. I’ll own that. I hope that bit helps someone who has not yet filled their pool and can rebuild before running into issues later. Rebalancing via script takes as long or longer than it does (or would have, in my case) to just backup and rebuild a fresh pool and restore the data. The Gurus were correct! That is my real message here LOL.

Protopia · October 4, 2024, 9:53pm

It is possible that:

The bad metadata block wasn’t needed to get to files; and
When the rebalancing script removed the old copy of files, the bad metadata block was deleted in the cleanup.

But this theory requires you to be able to read/copy and delete files that are related to a bad metadata block without getting a file access hard error, but metadata block handling is complicated so I guess that it is possible - especially if there were no files reported as having bad blocks. I imagine this is possible if e.g. the checksum on the metadata block was bad, but the data in the block was good, and if ZFS decided that when it read the metadata block and found the checksum mismatched it would nevertheless to use the data in the metadata block and see if that allowed it to get at the file you wanted.

Perhaps I can see why - a scrub will report an error with e.g. a minor attribute of a file which cannot be fixed - a read during a rebalancing script is probably only interested in getting at the data, so if ZFS still has what it needs to read the data even if an attribute is screwed, then it will probably do that.

etorix · October 5, 2024, 8:44am

Ah, that did the trick of removing dedup. And the metadata error (bug or not… blasphemy!) was related to dedup.

Vdev count is related to IOPS. A vdev, of any kind, has the IOPS of a single drive.
For throughput, on large transactions, type and width matter. Roughly:
n-way mirror, read n, write 1
n-wide raidzp, read and write n-p
So my expectation would be that 2*6Z2 provides roughly the same throughput (on large files) that 4*3Z1 with half the IOPS (so worse performance on multiple smaller transactions) but better resiliency.