ZFS Scrubs with no ECC (what could possibly go wrong?)

thomas-hn · January 13, 2025, 9:06am

Hello,

I know this ECC vs. Non-ECC topic it is a very sensitive one here in the forum.
But I would like to understand how a scrub exactly works on ZFS to make my own judgement if ECC is really needed or not (I had a debate with a colleague some days ago, I argueg for ECC, he argued against).

Let’s assume ZFS does a scrub over all the data on a pool and at least one of the Non-ECC RAM modules has some permanently faulty bits.

My current understanding is for a scrub:

ZFS reads a block of data (including a checksum) from the pool.
This information is stored into RAM.
ZFS calculates the checksum over the stored data in RAM.
ZFS compares the newly calculated checksum with the existing one in RAM (the one read from the pool).
If both checksums are fine, nothing to be done. Otherwise, ZFS reads the same data from another (redundant) location, copies this data into RAM, does some verification via checksum and replaces the faulty data with this redundant copy.

So, what happens if something goes wrong:

Steps 3) and 4): If ZFS calculates a wrong checksum over correct data (because of faulty bits in RAM), it assumes the data is faulty and tries to correct it. As long as we have some redundancy, it should be fine.
Step 5): If ZFS detects a problem in step 3) and 4) and tries to correct the data, it has to copy the redundant data into RAM, before it can repair the faulty blocks. What will happen if this copy goes wrong, because of faulty RAM bits. For my understanding it could happen that ZFS detects faulty blocks on the pool which cannot be repaired, because the redundant data is also faulty. In this case data could get lost (as ZFS thinks the data is faulty), even that the data on the drives might be fine and only the RAM has an issue.

Is this understanding correct? Please correct me if I’m wrong.

Thansk a lot in advance,

Thomas

jro · January 13, 2025, 1:46pm

You’d get ZFS errors complaining about checksums not matching, but because the checksums don’t match, ZFS won’t touch anything on the disk. You’d presumably get a whole boatload of these checksum errors across all of your disks because the same faulty RAM is re-used during the scrub process. At that point, you’d probably conclude that you’ve got other weird hardware issues rather than all of your disks failing simultaneously.

ZFS without ECC is arguably safer than most other filesystem without ECC. Still, ECC isn’t that expensive and might save you a headache so it’s probably worth the investment.

Arwen · January 13, 2025, 5:16pm

The scenario is commonly referred to the ZFS Scrub Of Death. Though generally described with non-permanent RAM errors.

It is in someways a Myth of Epic proportions.

But, my own opinion is that under extremely odd and unusual cases, non-ECC RAM could corrupt a pool during a scrub. My own take is that at some point, we will have enough data stored on ZFS using non-ECC RAM that statistically The ZFS Scrub Of Death will happen. Certainly not a regular occurrence.

On the other hand, we have seen many ZFS pool corruption problems here in the TrueNAS forums. Some are clearly un-related to non-ECC RAM:

Using TrueNAS as a VM, without proper safe guards, (aka ProxMox & TN accessing the same pool, at the same time)
Using gamer system boards that default to over-clocking, (where bit flips might crash a game… but in TrueNAS corrupt a pool).
Using a striped pool
Using SMR disks
Using wide RAID-Z1 with very large disks
ZFS software bug, (and yes, their have been such)

On occasion, we have not found an obvious or even odd theoretical cause of pool corruption for some cases. So it is entirely possible that they got the pool corruption from non-ECC RAM. @jro’s comment about ZFS without ECC RAM applies.

Because of such, “RAID is not a backup”.

etorix · January 13, 2025, 6:38pm

We’ve had ONE actual report of a Scrub of Death:

Mythical, certainly. Blown out of proportion with its actual probablity, most likely.
But the probability is not zero.

DAVe3283 · January 14, 2025, 5:54am

As the survivor of said Scrub of Death (which I also thought to be implausible previously), I want to note the stick of RAM was very bad. It would only take seconds in MemTest before errors started appearing.

Without ECC, and with most of the memory on that server consumed by ARC, I had no obvious sign that RAM was failing. Programs weren’t crashing, the system wasn’t rebooting, everything seemed fine. I did notice random checksum errors when reading the pool, but without any email notifications, it got pretty bad before I even noticed those. And when I did, I incorrectly assumed it was a failing HBA-- I did not have great cooling on the PCIe slots, and figured that after years, I had cooked the HBA.

When a replacement HBA and fan was installed, I kicked off a scrub to clear the pool errors, and the rest is history.

Moral of the story is if you don’t have ECC, and you start seeing random checksum errors, disable scrubs and test your RAM! But after this experience, it is ECC or nothing for my future servers.

pmh · January 14, 2025, 8:42am

This might be an interesting read:

https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

Arwen · January 14, 2025, 11:15am

Thank you for clarifying that detail.

I had minor doubts that your problem could be The Scrub Of Death. But, “RAM was very bad” now makes sense. Add in that you had occasional & random checksum errors, and you have a believer here.

Now how do we go about awarding you the dubious honor of the first documented case of The ZFS Scrub Of Death?

Sure we can get you a plaque. Maybe even a statue, but of what?

Their was a line in the linked article from @pmh that I liked:

And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:

Sometimes I feel that hardware is evil. However, in reality it is usually simple failure.

etorix · January 14, 2025, 11:57am

ericloewe · January 14, 2025, 1:06pm

Anyone who claims that has never dealt with printers.

pmh · January 14, 2025, 1:29pm

60f

daryusnslr · January 15, 2025, 7:02am

This discussion made me wonder about one of my file servers that does not have ECC memory and actually has a cheap consumer motherboard too. I have never encountered a ZFS error in my limited experience with TrueNAS (~3 years). Would it be possible to set an alert for such ZFS errors that may occur during a scrub? In Core 13.3’s GUI under related alert settings I can only see items for scrub started, finished, paused, or “failed to start”.

Ian_Posner · January 20, 2025, 9:16am

Is your data more valuable than the price differential between ECC and non-ECC? If so, just pay up and move on.

etorix · January 20, 2025, 12:04pm

…noting that second-hand DDR4 RDIMM is typically much cheaper than ECC UDIMM.