Unexpectedly high SSD-wear

Likely because they’re moving over to docker entirely by end of year. Hopefully we’ll have less woes in general then.

Thanks for clarifying on why that change actually could make a difference. I’d be interested to hear back from you on before/after results after some burn-in.

I could be wrong, but TXG synchronization will block reads until it is finished.
If you want to increase it, you should check the responsiveness when the write load is high.

This maybe true. I cannot test it as in my setup the network connection is the limiting factor. I can’t see a difference here as performance for me is always limited by the network.

Another minor update on k3s:

I did not find a way to change the path of the k3s logfile, but as I am not interested in these logs anyway, I just linked it to /dev/null with

sudo rm /var/log/k3s_daemon.log && sudo ln -s /dev/null /var/log/k3s_daemon.log

It seems as if this reduces write load by some more KB/s. Not sure what happens if logrotate kicks in.

I’ll leave it like this for the moment and see what happens during the next days.

2 Likes

Thank you for the explanation.

During first the installation Scale is formatting the whole disk with a huge additional partition depending on the size of it, this is for my understanding not necessarily in use. So if we have some portion of it unformatted, the disk can use this overprovisioning for a much longer lifetime.

Now to the next question, how can I tell the installation program to leave some part of the disk unformatted?

PS: my boot disk has 256 GB size

You can’t.

If you can’t stand the idea, get an even smaller and cheaper boot drive…

I fully agree with the statement! I was not talking about use the disk for something else, I was talking about extend the lifetime of the boot disk in a way not to use it fully. Overprovisioning is a possibility (SSD’s have this possibility). Your hint buy a smaller one sounds a little aggressive for me (but it is possible my English is not good enough) anyhow, I would like to find a solution to extend the lifetime of the boot disk and overprovisioning can be a solution if the writing can’t be limited.

Isn’t that drive specific, in the way you’d have to use a software tool from the manufacturer like Samsung Magician so set a larger part aside for over provisioning?

Also, wear levelling should lead to a more even distribution of writes on a larger disk anyway? Not sure about this, though.

For my understanding, the disc controller can use cells from the unformatted area. A lot of company’s use this possibility to extend the SSD live for the cost of less usable space. It’s common practice. Unfortunately, the controller can only use the boundary of the partition to distribute writes.

Before I forget, thank you the warm welcome.

I though you were lamenting the “waste” of space and wanted to use the capacity for something else. Sorry for the misunderstanding.
Explicit overprovisionning is generally available only from data centre drives. If that’s not the case for your drive, use it as it is and do not overthink about it.

Accept!

I can not agree with “only datacenter disk do that”, by default datacenter disks have only more reserved space compared to consumer disks for example, 3,84TB vs 4TB it’s the same cell count behind. But it’s possible to decrease the official size by not formatting it fully. So the controller can use it for substituting wore out cells. For example, QNAP offers SSD overprovisioning formatting by default.

Unlike a hard drive where sectors as seen by the O/S are hard-mapped to physical sectors, in an SSD the sectors as seen by the OS are soft-mapped to memory cells. Each cell can be written to a limited number of times before it fails (the number depending on the technology used), and depending on the number of cells (which is more than the stated capacity) the SSD with have an expected life measured by TOTAL BYTES WRITTEN (TBW). For a typical data storage usage SSD, some parts of the drive will contain non-volatile files that don’t change (and whose cells don’t accumulate writes), and some parts will contain volatile data that is always changing - so the cell wear is not necessarily even across all of the cells. (Some more sophisticated firmware can sometimes try to fix this by moving non-volatile data from low-write cells to higher write cells, but this uses up a write, so the firmware needs to be sophisticated enough to not make things worse.) To optimise the cell wear the firmware also needs to know which cells are not in use, and the TRIM operation allows the O/S to let the SSD firmware know about cells previously used to hold files which are now deleted and so these cells NOT in use.

Over provisioning is a fancy way of saying that you only partition (say) half of the disk). Primarily this reduces the amount of data you can hold in the hope that this then reduces the amount of data written. Providing that the O/S includes unallocated areas of the disk in its TRIM list, then the overprovisioning also helps the firmware spread the writes you make over a larger number of cells - but if the SSD thinks the unallocated areas of the disk are in use, then it will hinder rather than help.

How this all impacts the lifetime of an SSD depending on what it is used for.

For an SSD used for SLOG, I would argue that overprovisioning will probably not help. The amount of writes will not change, but if for any reason TRIM does not operate on the unallocated areas, and the firmware thinks these cells are in use, then the effect will be worse. However providing that the unpartitioned areas were trimmed at some point, and providing ZFS issues periodic TRIMs for the partitioned areas, I would guess that using the full size of the SSD for the SLOG vDev would be the best way of allowing the firmware to optimise the cell writes.

But for a boot drive, things will be different. Much of the operating system will be static files. The TrueNAS configuration file and the syslog will be volatile, as will any SWAP area on the device. For performance reasons these would be better placed on an SSD than a HDD, and I would argue that two identical SSDs or one that is twice the size would likely have the same (or at least very similar) TBW lifetime (which is based on the total size of the cells), and so the location of the volatile parts makes little difference in the long run (and the configuration file is as vital as the rest of the boot drive), however splitting the same number of cells across 2x SSDs managed by two independent firmware instances will be less effectively managed than one SSD where the firmware can optimise the cell usage as a whole.

Summary: My best guess: Buy a larger SSD than you need, but don’t overprovision.

1 Like

Thank you for your explanation.
Conclusion of my idea only one SSD in use: If Core installer leaves 15-20% not necessary space unformatted, the boot SSDs can have a much longer live time and the usable space is not attractive for some “tweakers” to use it for something else. And Trim has the unused “blank space” for substituting, Trim is doing this with spare cells (every SSD has it more or less Datacenter vs Consumer) for sure.

I agree that in effect over provisioning is effectively the same (or at least very similar to) ensuring that maximum space utilisation is 15%-20% less than the normal 80% i.e. (say) 60%.

However, for a given amount of data and writes, I do not believe that over-provisioning will increase the lifetime of the drive, but has the possibility of reducing it if (for any reason) the firmware thinks that the non-partitioned space is not trimmed.

Additionally, whilst having unpartitioned space does ensure that you don’t inadvertently use more space than intended, the consequence would also be that you might go over 80% when you otherwise wouldn’t (resulting in massive write performance degradation), and you might reach 100% when you wouldn’t resulting which for a boot drive might well result in an O/S crash.

So for these reasons I remain of the opinion that overprovisioning is not beneficial for a boot drive.

Can we prove it with some science, because to say no to new ideas is inheriting in mankind?
Sorry for my going away from the theme. Overprovisioning is a well established procedure in the IT world since SSDs are on the planet. I know TrueNas has more problems to solve. Something, what happens only sometimes (a lot of complaints over the years first with dying USB sticks) and always discussed away by premium members (Use a SSD). It is now existing with SSDs, why not more proactive discussing it? The overprovisioning for the boot disk is a snap to produce for the high skilled programmers at TrueNas. And there is my be a possibility to stop a never ending story.

Ok - I went to research this and found SSD Over-provisioning (OP)- Kingston Technology.

At a raw hardware level, NAND memory has the following characteristics:

  1. It is organised into pages which are much bigger than a single sector. A page needs to be erased (as a background task) before it can be written to.[1]

  2. To write a sector, then you need to read in any sectors that will become part of the page containing the written sector and write them to a new page.

  3. Once a page has been copied, it is then queued to be erased as a separate operation running in the background. Only once a page has been erased can it be used to process a write operation.[2]

  4. If the pool of erased pages is empty, then the SSD has to wait until a block of pages has been newly erased before it can do the next write - and write performance will suffer significant degradation. So maintaining a pool of pre-erased pages (built in overprovisioning, and user overprovisioning and those identified by TRIM) for immediate use by writes is essential to preventing the pool from becoming empty, and thus for maintaining SSD write performance.

  5. When you delete a file, the filesystem (like ZFS) only rewrites the directory/metadata blocks and free-space chain, and does not rewrite the blocks that the file itself uses. (In ZFS this often does not happen when the file is deleted by the user, but rather when it is no longer in any snapshots i.e. when the last snapshot it is in is deleted.) TRIM is used to inform the SSD of the sectors that are on the free-space chain, and the SSD firmware can then start to erase those pages that entirely consist of sectors on the free-space chain.

Suppose there are 64x 512B sectors in one 32KB page, then if you wrote data to only every 64th sector starting at sector 0, the drive would appear to the O/S to be only c. 1.5% used, but it would be using the same number of pages as if it were 100% full. If you then attempted to write another sector every 64 sectors starting at 32, then each page would need to be copied from the existing page replacing the sector that has been sent and written to a new page, and then the old page would need to be erased.

HOWEVER…

It is not as simple as that. ZFS has a block size which is multiple sectors and which may well be at least as big as a page. In other words disk usage may match page usage much more closely than the example above.

Bottom line

The bottom line is that for write intensive workloads, such as SLOG, user overprovisioning can be very beneficial to ensuring that the pool of erased pages never reaches zero. But for non-write intensive workloads, the built-in over-provisioning is probably perfectly adequate.


  1. It’s actually more complicated than this. Pages (of e.g. 4K) are grouped into blocks (of e.g. 64K). Only blocks can be erased (as a whole) but pages can be written individually providing that the corresponding space in the block is erased and hasn’t previously been written to. ↩︎

  2. It’s a bit more complicated than this. In a process called Garbage Collection, the firmware may merge pages from one block into another in order to use up spare pages in already written blocks, and create other blocks that have no used pages and which can thus be erased. ↩︎

1 Like

Thank you for digging deeper.

For the boot drive, I can see mostly write action, seldom read. To break the last post down, overprovisioning can extend boot drive SSD lifetime as well.

There is a difference between manufacturer over provisioning and user over-provisioning. Manufacturer overprovisioning of a 500GB SSD involves adding more cells and thus increasing TBW. User overprovisioning doesn’t add more cells, and doesn’t change TBW.

If you think that user overprovisioning can lead to a higher TBW, then please provide evidence.

And whilst a boot drive may do mostly writes i.e. % of operations that are writes (because in normal operation, it is primarily read during the boot itself, but after than it is used to write the configuration files and syslog) - but that does NOT mean that it is write intensive (quantity of write operations per second) nor that it will benefit from user overprovisioning.

I think it is important to note that there are two different ways to over provision a drive that can result in very different performance implication (also dependent on how the drive’s controller does things). One of these ways I will refer to partition based since that is how it works: you as the user define a partition that is less than the full size of the drive and use that. The other I will refer to as LBA based since (assuming the drive supports it) you use management software from the manufacture to redefine the size of the drive that gets presented to the operating system.

The major reason to make this distinction is that, based on the percentage of over provisioned NAND that the controller does not have to present up the stack, different strategies to better manage NAND lifecycles may be available to the controller that otherwise would not.

2 Likes