Freaking SSD "spare space"... how does it work? šŸ¤”

Iā€™m tired. My coffee isnā€™t helping. Pardon this thread, which may in fact be nothing more than rambling nonsense.


SSDs, whether they use the NVMe or SATA interface, whether they are in the 2.5" or m.2 form-factor, have something called ā€œavailable spare spaceā€ or ā€œspare cellsā€.

:question: Am I right so far?

If no, please correct me. :stop_sign:

If yes, please continue. :point_down:


The ā€œspare spaceā€ in an SSD is not used for wear-leveling. In fact, itā€™s exclusively used in the event of I/O errors. (Write errors? Read errors? Hold on to this thought, since Iā€™ll revisit it later in this post when I bring up ZFS.)

:question: Am I right so far?

If no, please correct me. :stop_sign:

If yes, please continue. :point_down:


An SSDā€™s ā€œspare spaceā€ is essentially the same as an HDDā€™s ā€œspare sectorsā€, correct? Meaning that when an SSD has to eat into its ā€œspare spaceā€, itā€™s really not much different than what an HDD does when it comes across bad sector, which it then ā€œrelocatesā€ to a ā€œspareā€ sector?

Is there a fundamental difference between an SSD employing its ā€œspare spaceā€ when compared to an HDD employing ā€œrelocatedā€ sectors? It seems like itā€™s the same exact thing.

:question: Am I right so far?

If no, please correct me. :stop_sign:

If yes, please continue. :point_down:


The above all assumes a single drive, regardless of the filesystem.

But then how does ZFS redundancy work with this?

Upon coming across a ā€œreadā€ or ā€œchecksumā€ error on a block, it obviously has a ā€œgood copyā€ of this block (mirror vdev) or is able to ā€œrecreateā€ this block via parity information (RAIDZ vdev).

I would assume that ZFS attempts to ā€œre-writeā€ the good block on the problematic drive.

Does the SSD drive immediately employ its ā€œspare spaceā€, in the same way that an HDD treats ā€œbad blocksā€?

Does using ZFS change anything? (That is, the act of ZFS coming across a block that is unreadable or ā€œcorruptā€ means its attempt to ā€œre-writeā€ a good copy of the block immediately gets intercepted and placed into the ā€œavailable spare spaceā€ area of the SSD?)

Is there anything unique about ZFS in which it ignores the logical block address (of the currently corrupt/unreadable data) and simply tries to write the ā€œgood copyā€ at a different LBA? (That is, the SSD does not employ its ā€œavailable spare spaceā€, but rather ZFS writes to a normal area, as with any other new writes.)

Does this mean that it leaves it up to the driveā€™s firmware to ā€œdeal withā€ the bad cell/sector at some later point in time?

If ZFS reports a ā€œchecksumā€ error, even though the block is ā€œreadableā€, does the SSD not know about this, and thus it does not employ its ā€œavailable spare spaceā€, since ZFS can simply just write a ā€œgood copyā€ of the block elsewhere on the SSD? Or does ZFS do something with the corrupted block to signal to the SSDā€™s firmware ā€œHey, youā€™ve got a problem here. You should mark this cell as ā€˜badā€™ and use your ā€˜available spare spaceā€™.ā€

Just wanted to congratulate you on your awesome face paint / tattoo job. :rofl:

1 Like

captain-america-i-understood-that-reference

So, generalizations ahead. SSD manufacturers are very cagey about what they do and do not publicly say (or want publicly said) about their firmware.

Extra NAND in an SSD is used for both wear-leveling and sparing, as well as internal parity (RAIN) to correct errors internally.

During a read, if the SSDā€™s internal checksumming says that a block of NAND is bad/failed and needs to be spared out, it will do that before ZFS knows about it, and return the corrected data while tagging the cell to be spared out. Similarly, during a write, if a cell fails to program, the SSD spares it out, and still says that the write is good.

This all happens in the firmware of the drive, and it increments the counter in the SMART/SCSI page data for used spare cells.

If the SSD isnā€™t able to correct the data, for some reason, thatā€™s when ZFS kicks in.

The key to all of this is that spare (NAND/sectors) arenā€™t directly addressable by a host device. Itā€™s all up to the firmware of the drive to detect and handle it. When ZFS does the copy-on-write thing and writes a refreshed copy of corrected-checksummed data, itā€™s writing it to ā€œthe driveā€ which then writes it to ā€œavailable NANDā€ - whether that comes from the ā€œmainā€ or ā€œspareā€ pool is up to the firmware.

Basically

image

4 Likes

Hereā€™s a key pointā€¦ does ZFS do the ā€œcopy-on-write thingā€ when itā€™s refreshing a corrupt block? Or does it just rewrite the corrupt block?

If it did the cow thing then the block would not be corrected in snapshots.

I suspect, but have not verified, that zfs actually just rewrites the block. Itā€™s up to the drive (hd or ssd) to decide where on the media to actually write the LBA.

The risk is that the block write may be interruptedā€¦ but itā€™s corrupt anyway, so that doesnā€™t matter. And the checksums are presumably still correct. And those checksums are blocks.

Turtles all the way down.

2 Likes

But why are there elephants on top?

2 Likes