Checksum Errors On 1st Scrub

mango · January 3, 2025, 2:45pm

I created my first TrueNAS system in October and apparently didn’t run a scrub until now. First question - is there a way to view scrub history? ‘zpool status’ only shows the last scrub.

The scrub reported checksum errors, says it repaired 4.43M, and gave a list of corrupted files.

What should I do here? From reading around, it sounds like I should remove the corrupted files and run another scrub to see what happens. Or I should remove the files, manually clear the checksum errors, and run another scrub.

I’m not clear on whether I should clear the errors, and I don’t understand if all the checksum errors are caused from the two corrupted files or if there’s more to it.

Specs & further info:

Bare metal
Corsair RM850x
Core i9-12900k
96GB DDR5 4000MT/s
ASUS Z790-V AX Prime
Intel Arc A380
HL-15 chassis + backplane (connects to LSI)
LSI 9305-16i (effectively cooled by HL-15 mid row of fans)
Intel x520-DA2 10G NIC
2x4TB WD Reds CMR, 6x18TB EXOS (4 vdev mirrors < affected pool)
2x1TB NVMe SSDs (SN770 / SN850x - app pool)
1x1TB SATA SSD (dump pool)
2x256GB SATA SSDs (boot pool)

I copied about 40TB of files from another NAS to this system immediately after building it. I have 4 mirrored vdevs. Only two vdevs were in the system upon initial creation, the other 2 were added later.

8 disks total. 6 of the disks reported checksum errors in various amounts, with the highest being 13 and the lowest 1. Both disks in two of the vdevs reported errors while the remaining two disks were spread across two vdevs, meaning not all disks in a given vdev reported errors (not sure if that’s relevant).

Many forum threads mentioned cabling or LSI cards as a cause of checksum errors, particularly if it happens to all disks and across multiple scrubs. I’m not sure that fits my scenario quite yet, but this is the only pool using the LSI card & backplane of the HL-15. I have a number of Docker containers running on a separate NVMe SSD pool without issue. I’d suspect issues with PSU, CPU, memory would manifest in some way on that pool as well. So from a hardware perspective, the LSI, backplane, or SAS cables would be my first guess. The LSI was purchased off eBay. I know they run hot but I got stock fans with the HL-15 that run 100% 24x7 and it is truly blasting all my cards with good airflow.

Appreciate any advice!

winnielinnie · January 3, 2025, 3:37pm

When they were run? Yes. The results after each scrub? No.

zpool history <poolname> | grep scrub

Your pool might be healthy again, assuming that those files/blocks were repaired and are no longer corrupted.

What is the actual output of this? (You can censor any private information.)

zpool status -v <poolname>

That’s concerning. You need to investigate why errors are hitting multiple disks spread across your pool/vdevs.

Was there any point in the past where a single disk was hitting errors, and it could have been tested or replaced before any other disks starting show signs?

What is the PSU? 8 spinners + 4 SSDs is notable.

This is not to say that the culprit is not the HBA, temperature, airflow, or cabling. Could very well be poor airflow and high temperatures on the card, even if you think it’s getting adequate cooling. (This one’s out of my park, sorry.)

mango · January 3, 2025, 4:37pm

Thanks for the response Winnie!

PSU is Corsair RM850x.

No individual disks within a vdev were replaced after adding. But as mentioned, an initial pool was configured with a single vdev and the rest were added over the course of a month or so.

Are you aware of any correlation between number of checksum errors and corrupt files? For example, do checksum errors always follow corrupt files or should I view these as separate issues?

I’m also curious if checksum errors are uncommon in this scenario where I copied 40TB of files from another NAS. Perhaps files were already corrupt on the other NAS and this isn’t as alarming as it seems. But I guess that circles back to above regarding whether corrupt files and checksum errors typically appear in tandem.

See screenshots below of CLI output. Sorry for the multiple images, only have VPN access on my phone at present. I omitted the listed corrupt files.

Thanks!

winnielinnie · January 3, 2025, 4:49pm

You could be looking at a deeper issue, in which some blocks are permanently corrupt on certain vdevs. Something’s going on. I highly doubt that 6 different HDDs are all showing signs of failure at the same time, independent of each other.

I would suspect a common denominator: cabling, HBA, temperature, and/or PSU.

You could try to run another full scrub, and see if it is able to repair the corrupted blocks with good copies from mirrored HDDs.

EDIT: This might not be the best idea, since a scrub will keep all drives reading nonstop, which might exacerbate the issue of a faulty HBA and/or card temperatures. You might want to address the underlying problem first.

“Corruption” detected by ZFS is not the same as “file corruption”, per se. If your other NAS or computer has an album of photos stored on it, and then some of these files become “corrupt” or “mangled”, they will still be considered “good” when copied over to your ZFS server.

Why? Because ZFS will gladly receive this new data to be written, divide the files into “blocks”, create checksums for these blocks, and then write them to disk. Yes, even the “mangled” JPEG photos are considered “good”, since their checksums will be verified upon reading them or scrubbing the pool.

neofusion · January 3, 2025, 4:50pm

This.
Such widespread checksum errors suggest something other than the individual drives.

mango · January 3, 2025, 5:18pm

If I understand you correctly, you view this as a hardware issue since there is not a strong enough correlation between the corrupt files and number of checksum errors. Is that correct?

If so, I’ll take your advice and instead of scrubbing again, I will replace the HBA and all SAS cables, delete the corrupted files, and then re-scrub. This seems like a safer approach and will further isolate the issue if the errors persist. I happen to have an older model HBA and spare SAS cables, so that is not too much effort or cost.

A few additional notes:

Regarding HBA temps, there is a 3x120mm mid-plane fan wall sitting right in front of the PCI cards. Cold air is blasting out of the back of the case right next to the cards. I know the HBAs run hot, same with the x520 and no issues there. My rack is also in a cool basement. Not to say the card itself doesn’t have issues, it was an eBay purchase.

Given that this case has a backplane, that adds another variable. I also purchased shorter cables from Cablemod for the PSU > backplane connections for cable management purposes, so I’ll swap those to the original PSU cables as part of this.

As mentioned, there are other pools on this machine running a number of different container services that don’t connect to the backplane, HBA, or SAS cables. This leads me to believe things like memory and PSU would be the last to check.

Good point about how ZFS views corruption. I wonder if this corrupted files occurred from copying the 40TB over the network at 1Gbps. This was done from a Synology NAS using rsync over SSH.

Thanks again.

winnielinnie · January 3, 2025, 7:03pm

You don’t need to delete anything. You might get lucky in which a subsequent scrub will repair those blocks, once all the drives “behave” in tandem.

You can do this later, if the same files/blocks show corruption, after addressing the other hardware-related issues.

winnielinnie · January 3, 2025, 7:08pm

When I had 6 spinning HDDs, the faulty PSU did not affect my 4 SSDs nor 5 of the 6 HDDs. Only 1 HDD was ever affected. When I shuffled my drives around, a different HDD would be affected. (Whichever drive was the last one on the SATA power daisy chain would suffer.)

Replacing the PSU immediately resolved this issue.

It might have nothing to do with you, but it sort of does show that a faulty PSU is hard to diagnose. The problems seem “random” or uncertain.

winnielinnie · January 3, 2025, 7:11pm

That would only be “corruption” of a discrepancy between the original file and its copy. ZFS has no knowledge of this. Whatever it receives, it checksums and writes to disk.

When ZFS alerts of corruption, it’s referring to blocks that were already written to disk, whose checksums no longer match what is expected. (This has nothing to do with the original files on Synology or files that might have been ruined from a failing NIC or network device.)

mango · January 3, 2025, 7:17pm

Thanks again for all the insight.

I ordered some new SAS cables and have a spare 9300-8i HBA I can throw in since I only have 8 drives currently.

Here’s the plan.

Replace HBA
Replace all SAS cables
Replace Cablemod cables w/original PSU versions
Reseat all drives on backplane
Scrub

mango · January 3, 2025, 7:32pm

One other question - I haven’t scheduled scrubs on the NVMe or SATA SSD pools because it seemed unnecessary, but would doing so be worth it to see if errors appear on those too? That would further point towards PSU / memory.

winnielinnie · January 3, 2025, 7:45pm

Corruption that cannot be detected by ZFS:

Corruption detected (and possibly self-healed) by ZFS:

ZFS can only deal with what is on the right half of each example. Everything on the left half is unrelated to what it can detect and repair.

winnielinnie · January 3, 2025, 7:47pm

Scrubbing SSD pools are just as important, and in fact they are faster and less stressful on the system.

A scrub task for once a month should suffice.

winnielinnie · January 3, 2025, 7:49pm

Remember to issue a zpool clear <poolname> before you start the new scrub.

mango · January 3, 2025, 8:49pm

What is the significance in running that command? I assume it clears the checksum errors, but is there a particular reason that’s important before running another scrub?

Thanks.

winnielinnie · January 3, 2025, 8:52pm

So that if the subsequent scrub (after addressing the hardware) finishes without errors, you’ll have a clean pool status. If you don’t clear the pool’s status, then the “errors” might be from earlier.

mango · January 3, 2025, 9:01pm

Does that imapair the pool’s ability to repair the errors in any way?

Ex, if I leave them and the hardware issue is fixed, wouldn’t this scrub repair them? If I clear them, does it still know to repair them?

Thanks.

winnielinnie · January 3, 2025, 9:36pm

No, it won’t impair ZFS self-healing.

The status of corrupt “files” are for you to review. To be pedantic, it’s only telling you which files are corrupt, but not which actual blocks. (Many files, especially larger ones, are comprised of multiple blocks.)

So that status report is only for you. If you clear it, and the same blocks remain corrupted and unrepairable, it will print them on the pool’s status again.

If you want to keep a text file of the current status readout, you can do something like this:

zpool status -v <poolname> > /path/to/poolstatus.txt

Stux · January 3, 2025, 9:55pm

You should run long smart tests on all your disks and review them.

Sometimes cabling/power/hba glitches will show up as UDMA CRC errors.

Other times errors will show up as reallocated/pending sectors etc.

Basically, you’re looking for a smoking gun. The smart tests may help to diagnose the errors as either disk related or something else.

mango · January 5, 2025, 11:23pm

After convincing myself that the HBA temperature wasn’t the issue, I decided to investigate again before shutting my server down for maintenance. I found it hotter to the touch than expected. Instead of replacing the card, I re-pasted the heatsink, and mounted a small Noctua fan pointing down directly above it.

The existing paste was like a rock, I had to scrape it off of the heatsink. While researching the re-paste process, I saw some other forum threads about this solving checksum errors for others.

After powering on, I went to run the zpool clear command on the pool before initiating a new scrub, but all errors were gone. I assume this happens after a reboot?

I’ve opted to troubleshoot things one at a time to pinpoint the issue instead of all the previously mentioned items at once. So for now, I re-pasted, reseated card & cables, and installed the Noctua fan. I also checked all physical components & cabling, and reseated all drives.