Freaking SSD "spare space"... how does it work? 🤡

winnielinnie · September 17, 2024, 1:29pm

I’m tired. My coffee isn’t helping. Pardon this thread, which may in fact be nothing more than rambling nonsense.

SSDs, whether they use the NVMe or SATA interface, whether they are in the 2.5" or m.2 form-factor, have something called “available spare space” or “spare cells”.

Am I right so far?

If no, please correct me.

If yes, please continue.

The “spare space” in an SSD is not used for wear-leveling. In fact, it’s exclusively used in the event of I/O errors. (Write errors? Read errors? Hold on to this thought, since I’ll revisit it later in this post when I bring up ZFS.)

Am I right so far?

If no, please correct me.

If yes, please continue.

An SSD’s “spare space” is essentially the same as an HDD’s “spare sectors”, correct? Meaning that when an SSD has to eat into its “spare space”, it’s really not much different than what an HDD does when it comes across bad sector, which it then “relocates” to a “spare” sector?

Is there a fundamental difference between an SSD employing its “spare space” when compared to an HDD employing “relocated” sectors? It seems like it’s the same exact thing.

Am I right so far?

If no, please correct me.

If yes, please continue.

The above all assumes a single drive, regardless of the filesystem.

But then how does ZFS redundancy work with this?

Upon coming across a “read” or “checksum” error on a block, it obviously has a “good copy” of this block (mirror vdev) or is able to “recreate” this block via parity information (RAIDZ vdev).

I would assume that ZFS attempts to “re-write” the good block on the problematic drive.

Does the SSD drive immediately employ its “spare space”, in the same way that an HDD treats “bad blocks”?

Does using ZFS change anything? (That is, the act of ZFS coming across a block that is unreadable or “corrupt” means its attempt to “re-write” a good copy of the block immediately gets intercepted and placed into the “available spare space” area of the SSD?)

Is there anything unique about ZFS in which it ignores the logical block address (of the currently corrupt/unreadable data) and simply tries to write the “good copy” at a different LBA? (That is, the SSD does not employ its “available spare space”, but rather ZFS writes to a normal area, as with any other new writes.)

Does this mean that it leaves it up to the drive’s firmware to “deal with” the bad cell/sector at some later point in time?

If ZFS reports a “checksum” error, even though the block is “readable”, does the SSD not know about this, and thus it does not employ its “available spare space”, since ZFS can simply just write a “good copy” of the block elsewhere on the SSD? Or does ZFS do something with the corrupted block to signal to the SSD’s firmware “Hey, you’ve got a problem here. You should mark this cell as ‘bad’ and use your ‘available spare space’.”

Constantin · September 17, 2024, 2:37pm

Just wanted to congratulate you on your awesome face paint / tattoo job.

HoneyBadger · September 17, 2024, 3:38pm

captain-america-i-understood-that-reference

So, generalizations ahead. SSD manufacturers are very cagey about what they do and do not publicly say (or want publicly said) about their firmware.

Extra NAND in an SSD is used for both wear-leveling and sparing, as well as internal parity (RAIN) to correct errors internally.

During a read, if the SSD’s internal checksumming says that a block of NAND is bad/failed and needs to be spared out, it will do that before ZFS knows about it, and return the corrected data while tagging the cell to be spared out. Similarly, during a write, if a cell fails to program, the SSD spares it out, and still says that the write is good.

This all happens in the firmware of the drive, and it increments the counter in the SMART/SCSI page data for used spare cells.

If the SSD isn’t able to correct the data, for some reason, that’s when ZFS kicks in.

The key to all of this is that spare (NAND/sectors) aren’t directly addressable by a host device. It’s all up to the firmware of the drive to detect and handle it. When ZFS does the copy-on-write thing and writes a refreshed copy of corrected-checksummed data, it’s writing it to “the drive” which then writes it to “available NAND” - whether that comes from the “main” or “spare” pool is up to the firmware.

Basically

Stux · September 17, 2024, 4:17pm

Here’s a key point… does ZFS do the “copy-on-write thing” when it’s refreshing a corrupt block? Or does it just rewrite the corrupt block?

If it did the cow thing then the block would not be corrected in snapshots.

I suspect, but have not verified, that zfs actually just rewrites the block. It’s up to the drive (hd or ssd) to decide where on the media to actually write the LBA.

The risk is that the block write may be interrupted… but it’s corrupt anyway, so that doesn’t matter. And the checksums are presumably still correct. And those checksums are blocks.

Turtles all the way down.

NickF1227 · September 17, 2024, 7:03pm

But why are there elephants on top?

DjP-iX · September 17, 2024, 7:15pm