Unexpectedly high SSD-wear

mat.bitty · May 26, 2024, 9:01pm

Hi,
this is my first question here, so please excuse any mistakes.

I am currently running a TrueNAS-SCALE-24.04.0 on a regular low-energy X86-PC as my storage and app-host. My main usage is file storage and backup. The apps I have installed are PiHole, UrBackup and GiTea. Additionally I have a virtual machine setup in Qemu with Debian and OpenHAB.

In general, everything is running perfectly fine and I really love the great features that TrueNAS offers.

However, one of the SSDs in the system died some days ago. After replacing it, I looked closer on what’s happening and realized that TrueNAS (and the services running) are writing approx. 80GB/day to the SSDs according to the S.M.A.R.T. reports, which leads to high SSD-wear.

Looking at the reports for all my disks in the TrueNAS-GUI it shows an average write to the system disk of approx 200KB/s (which matches the approx 16GB/day the S.M.A.R.T. for this disk shows). Looking at the disks for my apps and the VM it shows an average write of approx. 1,5M/s (the S.M.A.R.T. report for this disc shows approx. 85GB/day) which is really a lot - especially as the “real” data written to the disk changes at max by 1GB/day.

First question: What is truenas constantly writing to the system disk? Can I disable this to ensure longevity of my disks?
Second question: When looking into my virtual machine with htop, it writes approx. 100-200KB/s to disk, why is TrueNAS writing more than 1MB/s?

I really appreciate any hint. Just let me know, if more information is required.

Thank you so much!

Sara · May 27, 2024, 5:05am

Unfortunately, I can’t tell you what is causing the writes.

We don’t know what SSDs you are using, but 80GB/day is nothing to most SSDs.
Even for cheaper consumer drives, this should not exceed their TBW.

You could change the system dataset and syslog location.

Do you use RAIDZ? This could be rw amplification?

Maybe your SSD failure had nothing to do with SSD wearout and this was just a coincidence.

mat.bitty · May 30, 2024, 8:08am

Hi Sara,
thank you very much for the quick and helpful reply. After digging a little deeper in the issue I found out the following:

Half of the writes was related to some logging inside my virtual machine that I turned off now. Anyway, in the machine it was only 200 KB/s, which would have been ok, but it looks as if write amplification made it really bad here.
One of my SSD was extremely cheap (as I was not expecting so many writes) and is showing a health of 60% after approx. 22 TBW, which is approx 300 days based on the 80 GB/day.
The related disks are in a ZFS Mirror with mixed capacity. I was not aware of ZFS write amplification, but that seems to be a huge part of the issue. As I could not find much information about it - is there a way to reduce write amplification in TrueNAS?
A huge part of the writes seems to come from the pure existence of Kubernetes. It is still there if I stop all apps and just leave the App environment configured. htop also shows some kubernetes services writing to disk in this case. If I unset the Kubernetes volume, the writes are gone (but the apps also). Is there a way to disable logging of Kubernetes?

Any further help would be appreciated as I don’t want to wear out my hardware if I can avoid to do so.

Thank you!

Sara · May 30, 2024, 9:08am

Hi Mat
I don’t use TrueNAS for apps, only for storage. That is why I probably can’t help you much here.

The default volblocksize should be 16k (at least that is the now default for ZFS). So TrueNAS provides blocks of 16k size. But with mirrors, I am not sure if you experience write amplifications at all. May for writes smaller than 16k? Do you small writes, maybe like a database?

But even then, your almost 10x write amplification seems way too high for a mirror. Not really sure where the problem could be.

essinghigh · May 30, 2024, 9:23am

Ignore, misread.

etorix · May 30, 2024, 10:55am

I think you have to disable apps entirely to shut Kubernetes down.

mat.bitty · May 30, 2024, 2:57pm

Thanks for all your answers.

After some googling, I now understand the meaning of the “Record Size” setting. In my setup the default was 256K which I now reduced to 16K for all app and VM related drives as I think in this case mainly databases and log-files with small changes will be written and the smaller chunk size is more appropriate.

As far as I understand, I just need to change the setting (as I did now) and it will become effective, whenever sectors are written. I am not sure if it can solve my problem without deleting all the data and recreating the volumes. I think, I’ll just wait for some days and see, how this evolves.

Regarding Kubernetes: I like it and need it up and running for my apps. My question was just if I can somehow set its log-level to something like “error” as it seems to be quite verbose (at least regarding its write activity).

Sara · May 30, 2024, 4:45pm

That is not it. Record size is for datasets. Recordsize describes the max size and is not a fixed size. Your 256k dataset can have a 64k write.

For zvols, it is different. TrueNAS provides blocks that have a volblocksize (default 16k). Here you get amplification, if the block is bigger than your workload you write into the block.

I don’t know about kubernetes. I only know that Proxmox uses zvols for VMs and datasets for lxc containers.

I am pretty sure this is not true. The default is 128k, unless that changed recently.

Fleshmauler · May 31, 2024, 12:14am

That option has been disabled & is now fixed to the boot drive(s); see [NAS-129237] - iXsystems TrueNAS Jira to suggest bringing it back if you (like me) miss it.

Though considering it seems to be apps logging, this is likely highly unrelated. [Edit] - I’d see if it is possible within the app to reduce logging.

winnielinnie · May 31, 2024, 12:16am

Yay…

B52 · May 31, 2024, 11:35am

Hi,
This is my first post here, so please excuse any mistakes. The same permanent hammering (writes) can be seen on my boot disk. There are no applications installed. There are about 138 KiB/s of average writing. Sounds not much, but in the long run, it wears the boot SSD out; additionally, I can’t see any similar reading. So the question is, where are the data for? What happens if the disk is full? I guess it would be helpful to have someone with more skills than I have to keep an eye on it.

Fleshmauler · May 31, 2024, 8:07pm

Logs rotate so they shouldn’t ever really fill the disk out to my understanding. For the hit on your boot drives, see my link & vote on it & hopefully iX will revert back to giving us the option on which pool logs are saved; imo since my spinning rust never spins down, I’d personally prefer to have the logs there.

[Edit] You also asked what the logs are for - general troubleshooting & system health purposes. If anything is going wrong at all and it isn’t an obviously physical issue, logs will very much be of assistance for either you or iX to find a solution.

mat.bitty · May 31, 2024, 8:09pm

Hi,
thanks again for the fruitful discussion that helped me to find more readings about ZFS and TrueNAS. The more I read, the more I see that ZFS is really a masterpiece. There is so much in it…

However, Sara, you were right, my volblocksize is 16K and should be fine for my case. As I do not use RAID-Z but pure mirroring there should not be much write amplification. The default record size was 128K in my system, I obviously just changed it to 256K for one of my ZVOLs during testing. Anyway, changing the setting did not influence write behaviour.

After some more googling I found many people complaining about the high write-loads that Kubernetes k3s and Truenas itself is generating and the SSD wear it may cause. One of the solutions I found is this one: https://www.reddit.com/r/truenas/comments/n2whcy/scale_temporary_solution_to_excessive_writes_to/
I now added the line
echo 60 > /sys/module/zfs/parameters/zfs_txg_timeout
to the init scripts as a postinit script (in the GUI). On every boot the setting for the timeout of ZFS-transaction groups is increased from 5 (the default) to 60 seconds. This way, all data that should be written to the disks is kept in memory for up to 60 seconds and then flushed to disk. From what I see until now this reduces the amount of data written to disk significantly in my case. Will test some days and let you know if any issues appear.

Fleshmauler · May 31, 2024, 8:19pm

Wouldn’t that only increase the time data is spent in memory & possibly increase chances/amount of data loss on system hang/power outage/whatever? I could be wrong, but I don’t see that this would actually decrease the amount of data written, only how long the system waits before writing it to disk? Someone will hopefully correct me & shine more light on this.

[edit; one day I’ll fully think out a post before sending it]
Also the term ‘timeout’ to me usually specifies the maximum amount of time, not the set amount of time. Like changing the ‘timeout’ value for a postinit script isn’t going to change how long it takes for the script to do the needful, just how long the system waits for the script to complete the needful before giving up on the script entirely & killing the process.

Either way, maybe I’m way off on this & this is indeed something good to tweak for those of us for a concerned about wearing out ssds.

mat.bitty · May 31, 2024, 8:28pm

From what I see until now, you are right in theory.

But for my system I see less writes overall. My guess:

As mainly small changes to log-files are written, the chunk sizes increase, resulting in less overhead data (file system indices, metadata, checksums) and
Some of the changes may “cancel out” as they are changed forward and backward in memory, before being written to disk.

Regarding Kubernetes, I also think I found the reason for its verbose logging which seems to be related to this issue:
https://ixsystems.atlassian.net/browse/NAS-116443
that the TrueNAS-Team does not want to fix for some reason…

mat.bitty · May 31, 2024, 8:31pm

Regarding your edit:

The line of code that I added to postinit changes the timeout for ZFS on the machine before any transaction is forced to be written to disk. Besides there is a limit in chunk size that may trigger the flush.

So the init script changes a ZFS-setting that is valid until a restart. This way it influences all writes to ZFS-volumes on the machine.

Fleshmauler · May 31, 2024, 8:44pm

Likely because they’re moving over to docker entirely by end of year. Hopefully we’ll have less woes in general then.

Thanks for clarifying on why that change actually could make a difference. I’d be interested to hear back from you on before/after results after some burn-in.

mvd8tvJI · May 31, 2024, 8:58pm

I could be wrong, but TXG synchronization will block reads until it is finished.
If you want to increase it, you should check the responsiveness when the write load is high.

mat.bitty · May 31, 2024, 9:10pm

This maybe true. I cannot test it as in my setup the network connection is the limiting factor. I can’t see a difference here as performance for me is always limited by the network.

mat.bitty · May 31, 2024, 9:15pm

Another minor update on k3s:

I did not find a way to change the path of the k3s logfile, but as I am not interested in these logs anyway, I just linked it to /dev/null with

sudo rm /var/log/k3s_daemon.log && sudo ln -s /dev/null /var/log/k3s_daemon.log

It seems as if this reduces write load by some more KB/s. Not sure what happens if logrotate kicks in.

I’ll leave it like this for the moment and see what happens during the next days.