Discussion - SCRUB Errors - Is data really gone?

What ZFS SCRUB Errors specifically define if a file(s) are corrupt and the data to the user is bad, meaning if this was the users only copy of the data, that person no longer has that data.

And I’d like to preface this conversation with the fact that we are not willing to pay the costs of a data recovery company as the average person would not be able to afford it.

This topic sort of comes up daily when someone reports some error message on a corrupt piece of data, and in a forum thread I read this morning had some contradictions and interesting points which led me to read into it a bit more.

Before this morning I would have said that any SCRUB data corruption meant that the data is no longer available to the user, but now I’m thinking it is a bit more complicated than that and may depend on the pool layout, but maybe not.

Specifically if we have a MIRROR and a SCRUB says one of the drives has CKSUM errors and then lists a corrupt file name, does that mean the file is corrupt on just the one drive, both drives, to the pool as a whole?

It would be nice to make a simple chart that states pool layouts and the various scenarios and list if the file is corrupt or not, if not, where that data resides.

For example: RAIDZ2 using 5 drives, SCRUB reports permanent errors for file “ABCD” and CKSUM has a value of 34 for one of the drives, all other drives have no errors. What does this mean?

In the past I’d say it meant that the file “ABCD” was corrupt and not recoverable. But what about a MIRROR? Same thing? I don’t know.

Based on how I read the statement below, if ZFS reports a corrupted file, it is corrupt for the entire pool, even if only one drive is showing CKSUM errors. And let’s keep in mind that it does not matter how the corruption occurred but the corruption does exists.

From an Oracle Document:
Data corruption errors are always fatal. Their presence indicates that at least one application experienced an I/O error due to corrupt data within the pool. Device errors within a redundant pool do not result in data corruption and are not recorded as part of this log. By default, only the number of errors found is displayed.

Now if a SCRUB does not report a file error and does report CKSUM errors for a drives, that means the data is still intact but a drive may be having a problem. Again, my interpretation.

Feel free to post any factual data you can locate and your interpretation of that data but the most important thing is to make it simple and easy to read. I’d like to add the results of this discussion to @Arwen ZFS pools & power loss Resource once this is clear, since unstable systems and power loss seem to be blamed for a lot of these kinds of errors.

It is my understanding that the stats in zpool status are:

  • Read Errors are disk read sector errors detected and not correctable by sector ECC. (Aka Bit Rot… where weak magnetic bits flip, or SSD cells drain too far.)
  • Write Errors are disk errors saying it can’t perform the write, (no more spares, loss of sector header on HDDs)
  • Checksum Errors are data that was read successfully, (no Read Error as above), but the Metadata checksum for the data block failed to compare.

Now for causes of Checksum errors, their can be several:

  • Bit Rot - Where enough data bits flipped on a disk sector, in such a way that the disk sector error correcting code still matches, so no Read Error
  • Original write was in wrong place - The original data was written in the wrong place, basically firmware bug in the drive. So a later read, reads random data that won’t match the ZFS data checksum.
  • Evil RAM - Bad RAM caused a bit flip of a block to be written, AFTER it was checksummed, but before it was written.
  • Drive cable - Lose, electrical noise, too long cable can cause bit flips in data. Generally SATA only, as SAS has a checksum on data transfers, (if I remember correctly), and SAS would likely perform a retry.
  • Power supply - Lose cable, low voltage, noisy power can cause bad data.

Of course, Evil RAM / Bad RAM that alters the data before it was checksummed won’t be detected as bad by ZFS… This is generally prevented by ECC RAM.

Now whether you get permanent data loss depends on redundancy and what type of error.

  • Bit Rot - Should be recoverable with any type of redundancy, (copies=2/3, Mirror, RAID-Zx, dRAID), assuming bit-rot has not affected the redundant data.
  • Original write was in wrong place - Should be recoverable with any type of redundancy, (copies=2/3, Mirror, RAID-Zx, dRAID), assuming only 1 drive mis-wrote data.
  • Drive cable / Power supply - These, in my opinion, can lead to permanent data loss. But not always. This is specific to each reported case, and hard to nail down.

Thus, to get permanent data loss, as shown with zpool status, (without or without -v), the data appears bad to ZFS. Meaning any redundancy was tried and failed to generate a good checksum for the data. In the case of a 2 way Mirror, failure of one copy would cause an immediate attempt on the other copy. And if that failed too, then data loss for that block / file.

Now if the same 2 way Mirror had the first copy read bad, but the second good, then the appropriate zpool status statistics would be incremented, the good data written to the bad location, and the good data supplied to the user, (if it was not a scrub). No permanent error.

As been shown before here in the forums, errors caused by bad drive cables, (data or power), that whence fixed, could allow a ZFS scrub to both succeed and without doing any correction of perceived errors.


As for adding to the ZFS pools & power loss Resource, I want to keep that related to power loss. Pool loss is a whole different resource that probably needs to written.

It is just that lots of people complain that “I lost power, and now my pool won’t import”. That is likely not ZFS’ fault and I wanted to show why that is probably the case.