My large Truenas Scale ZFS pool had checksum errors on one of the 4xRaidZ1 VDEVs in the pool.
Everywhere I looked, the Truenas Scale “gurus” only suggested rebuilding the pool. I’m no expert in Truenas Scale, but I didn’t like that answer, so I kept digging. Here’s my situation:
I’m running a Truenas Scale system with a 65TB pool, spread across 12 disks of 12TB each. The setup is configured as 4x RaidZ1, which gave me the 800MBps I was looking for on my 10Gb fiber network. And it works great! But here’s where I messed up: I didn’t fully read before making changes, and since the array was empty at the time, I thought it would be fine to experiment. What I did was enable deduplication. Lesson learned: never, ever do that without knowing what you’re getting into.
I had 256GB of RAM, but even so, deduplication quickly ate up way too much memory. I disabled it almost a day after enabling it, but by that time, the damage was done. The problem I didn’t notice immediately was a checksum error. One of my RaidZ1 groups (3 drives) in the pool consistently showed 2044 errors after every scrub, triggering a flood of alerts. When I checked the pool the error was on a file Metadata 0x0 which I could not find anywhere but seemed to tie back to deduplication metadata which was no longer enabled.
I tried everything: I replaced each drive in that group with new ones, swapped them in and out for testing, but the errors followed the new drives. I changed cables, improved cooling, even swapped the controller, and was just about ready to back everything up and start fresh with a new pool. That’s what all the “gurus” suggested, after all.
But then I started thinking about the checksums. I remembered that during an update to Truenas, the checksum algorithm had changed from SHA256 to SHA512. So I went into System Settings, navigated to Storage > Dataset > Edit Dataset (for the only pool I have) > Advanced, and changed the checksum from SHA512 back to SHA256. After scrubbing, guess what? All the errors disappeared.
Does that mean I’m still a noob when it comes to Truenas Scale? Maybe (Probably!). I may not know much about Truenas Scale, but fixing this issue after nearly wiping my pool made me want to share this for anyone else in the same situation—especially if you’ve messed with deduplication early on and later faced checksum errors in Metadata 0x0 that wouldn’t scrub clean.