ZFS Scrubs with no ECC (what could possibly go wrong?)

Hello,

I know this ECC vs. Non-ECC topic it is a very sensitive one here in the forum.
But I would like to understand how a scrub exactly works on ZFS to make my own judgement if ECC is really needed or not (I had a debate with a colleague some days ago, I argueg for ECC, he argued against).

Let’s assume ZFS does a scrub over all the data on a pool and at least one of the Non-ECC RAM modules has some permanently faulty bits.

My current understanding is for a scrub:

  1. ZFS reads a block of data (including a checksum) from the pool.
  2. This information is stored into RAM.
  3. ZFS calculates the checksum over the stored data in RAM.
  4. ZFS compares the newly calculated checksum with the existing one in RAM (the one read from the pool).
  5. If both checksums are fine, nothing to be done. Otherwise, ZFS reads the same data from another (redundant) location, copies this data into RAM, does some verification via checksum and replaces the faulty data with this redundant copy.

So, what happens if something goes wrong:

  • Steps 3) and 4): If ZFS calculates a wrong checksum over correct data (because of faulty bits in RAM), it assumes the data is faulty and tries to correct it. As long as we have some redundancy, it should be fine.
  • Step 5): If ZFS detects a problem in step 3) and 4) and tries to correct the data, it has to copy the redundant data into RAM, before it can repair the faulty blocks. What will happen if this copy goes wrong, because of faulty RAM bits. For my understanding it could happen that ZFS detects faulty blocks on the pool which cannot be repaired, because the redundant data is also faulty. In this case data could get lost (as ZFS thinks the data is faulty), even that the data on the drives might be fine and only the RAM has an issue.

Is this understanding correct? Please correct me if I’m wrong.

Thansk a lot in advance,

Thomas

You’d get ZFS errors complaining about checksums not matching, but because the checksums don’t match, ZFS won’t touch anything on the disk. You’d presumably get a whole boatload of these checksum errors across all of your disks because the same faulty RAM is re-used during the scrub process. At that point, you’d probably conclude that you’ve got other weird hardware issues rather than all of your disks failing simultaneously.

ZFS without ECC is arguably safer than most other filesystem without ECC. Still, ECC isn’t that expensive and might save you a headache so it’s probably worth the investment.

4 Likes

The scenario is commonly referred to the ZFS Scrub Of Death. Though generally described with non-permanent RAM errors.

It is in someways a Myth of Epic proportions.

But, my own opinion is that under extremely odd and unusual cases, non-ECC RAM could corrupt a pool during a scrub. My own take is that at some point, we will have enough data stored on ZFS using non-ECC RAM that statistically The ZFS Scrub Of Death will happen. Certainly not a regular occurrence.


On the other hand, we have seen many ZFS pool corruption problems here in the TrueNAS forums. Some are clearly un-related to non-ECC RAM:
  • Using TrueNAS as a VM, without proper safe guards, (aka ProxMox & TN accessing the same pool, at the same time)
  • Using gamer system boards that default to over-clocking, (where bit flips might crash a game… but in TrueNAS corrupt a pool).
  • Using a striped pool
  • Using SMR disks
  • Using wide RAID-Z1 with very large disks
  • ZFS software bug, (and yes, their have been such)

On occasion, we have not found an obvious or even odd theoretical cause of pool corruption for some cases. So it is entirely possible that they got the pool corruption from non-ECC RAM. @jro’s comment about ZFS without ECC RAM applies.

Because of such, “RAID is not a backup”.

3 Likes

We’ve had ONE actual report of a Scrub of Death:

Mythical, certainly. Blown out of proportion with its actual probablity, most likely.
But the probability is not zero.

5 Likes

As the survivor of said Scrub of Death (which I also thought to be implausible previously), I want to note the stick of RAM was very bad. It would only take seconds in MemTest before errors started appearing.

Without ECC, and with most of the memory on that server consumed by ARC, I had no obvious sign that RAM was failing. Programs weren’t crashing, the system wasn’t rebooting, everything seemed fine. I did notice random checksum errors when reading the pool, but without any email notifications, it got pretty bad before I even noticed those. And when I did, I incorrectly assumed it was a failing HBA-- I did not have great cooling on the PCIe slots, and figured that after years, I had cooked the HBA.

When a replacement HBA and fan was installed, I kicked off a scrub to clear the pool errors, and the rest is history.

Moral of the story is if you don’t have ECC, and you start seeing random checksum errors, disable scrubs and test your RAM! But after this experience, it is ECC or nothing for my future servers.

3 Likes

This might be an interesting read:

https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

2 Likes

Thank you for clarifying that detail.

I had minor doubts that your problem could be The Scrub Of Death. But, “RAM was very bad” now makes sense. Add in that you had occasional & random checksum errors, and you have a believer here.

Now how do we go about awarding you the dubious honor of the first documented case of The ZFS Scrub Of Death?

Sure we can get you a plaque. Maybe even a statue, but of what?

Their was a line in the linked article from @pmh that I liked:

And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:

Sometimes I feel that hardware is evil. However, in reality it is usually simple failure.

Anyone who claims that has never dealt with printers.

3 Likes

60f

9 Likes

This discussion made me wonder about one of my file servers that does not have ECC memory and actually has a cheap consumer motherboard too. I have never encountered a ZFS error in my limited experience with TrueNAS (~3 years). Would it be possible to set an alert for such ZFS errors that may occur during a scrub? In Core 13.3’s GUI under related alert settings I can only see items for scrub started, finished, paused, or “failed to start”.

Is your data more valuable than the price differential between ECC and non-ECC? If so, just pay up and move on.

6 Likes

…noting that second-hand DDR4 RDIMM is typically much cheaper than ECC UDIMM.

3 Likes

It does seem a heated topic, it really should be a question of stable ram vs unstable ram rather than ecc vs non ecc. Stable non ecc is no more dangerous than stable ecc. There seems to be a theory that if ram isnt ecc, then it is prone to random bit flips and the like making it inherently unstable when that isnt the case, ram when not defective along side non defective components on the system and ran in spec will be stable, no bit flips or stuck bits.

Now here is my take on ZFS, scrubbing etc.

I dont really have an opinion on exactly what is done during a scrub, but I do have an opinion on what is likely to happen if you have RAM instability.

So if you have a level of instability, that isnt severe enough to be noticeable from day to day, so no kernel panics, no apps crashing, no services crashing, the chances are the vast majority of i/o’s will not be affected by corruption caused by RAM, meaning that there would not be enough compromised i/o operations for a scrub to break a pool. As well as normal usage breaking a pool, I think the only thing that might be seen is the occasional cksum error which gets automatically repaired (assuming not using stripe or single disk).

If the RAM is bad enough that its routinely compromising i/o at a high frequency, the chances are you will be noticing much bigger things like kernel panics, crashing daemons, crashing games and so on. If TrueNAS is in a VM, the VM crashing etc. This level of instability would likely be picked up within a few seconds of running a online RAM stress test (not memtest86).

Generally the vast majority of systems with RAM instability will be down to unstable factory overclocked configuration aka XMP. ECC has the advantage as well that typically these modules are not gamer orientated so their spec’d speeds and timings are far more sensible which will be contributing heavily to their stability. So yeah, if not using ECC, run at Jedec, not XMP, test it with something like stressapp, and dont fiddle with things like trefi. ECC I think is more about insurance, being assured, rather than a requirement. Of course if mainstream consumer kit supported ECC, I think more people would be using it, but it typically requires specific boards and CPU’s. Ironically cheaper non gamer orientated dimms may be batter bet than premium dimms, I remember a faulty motherboard I had, both premium gamer dimms wouldnt even post reliably at jedec, but the cheap dimm had no issues, it had much looser timings.

Finally ZFS is not special on this, unstable RAM has the potential to destroy any file system, or data on that file system. If anything ZFS is more resilient due to metadata backups and ability to carry out repairs.

It is my understanding that all existing used blocks of the ZFS pool are checked against the checksum stored in the parent entry.

Data block checksums are stored in the directory Metadata. And the directory Metadata for a file’s checksum, is stored in the parent directory Metadata. All the way through the file system, though starting from the top down.

Yes.

Partly right. If the unstable RAM caused a bit flip in Metadata AFTER it was check-summed, and both Metadata copies were taken from that corrupted RAM copy, then it is possible that the underlying data could be gone. ZFS won’t use either copy of the Metadata because the checksum comparison fails.

Next, if the unstable RAM caused a bit flip before the Metadata was check-summed, then ZFS would not know that the Metadata is corrupt. Potentially causing the OS to crash when ZFS attempts to use that Metadata.

Of course, if only 1 copy of the Metadata is corrupted, and the other is good, well, ZFS will use the good check-summed copy and fix the other.

…or even that is noticeable, but not clearly pointing to defective RAM.

ECC modules can fail just as well as non-ECC but ECC will report a failure while non-ECC will let you go through all steps of troubleshooting until you pay your respect to @winnielinnie and memtest™ your RAM.
Peace of mind is well worth having ECC RAM.

4 Likes