Hello,
I know this ECC vs. Non-ECC topic it is a very sensitive one here in the forum.
But I would like to understand how a scrub exactly works on ZFS to make my own judgement if ECC is really needed or not (I had a debate with a colleague some days ago, I argueg for ECC, he argued against).
Let’s assume ZFS does a scrub over all the data on a pool and at least one of the Non-ECC RAM modules has some permanently faulty bits.
My current understanding is for a scrub:
- ZFS reads a block of data (including a checksum) from the pool.
- This information is stored into RAM.
- ZFS calculates the checksum over the stored data in RAM.
- ZFS compares the newly calculated checksum with the existing one in RAM (the one read from the pool).
- If both checksums are fine, nothing to be done. Otherwise, ZFS reads the same data from another (redundant) location, copies this data into RAM, does some verification via checksum and replaces the faulty data with this redundant copy.
So, what happens if something goes wrong:
- Steps 3) and 4): If ZFS calculates a wrong checksum over correct data (because of faulty bits in RAM), it assumes the data is faulty and tries to correct it. As long as we have some redundancy, it should be fine.
- Step 5): If ZFS detects a problem in step 3) and 4) and tries to correct the data, it has to copy the redundant data into RAM, before it can repair the faulty blocks. What will happen if this copy goes wrong, because of faulty RAM bits. For my understanding it could happen that ZFS detects faulty blocks on the pool which cannot be repaired, because the redundant data is also faulty. In this case data could get lost (as ZFS thinks the data is faulty), even that the data on the drives might be fine and only the RAM has an issue.
Is this understanding correct? Please correct me if I’m wrong.
Thansk a lot in advance,
Thomas

