I’ve seen a production system completely lock up once a pool hit 100% storage & become completely unresponsive. Luckily, it wasn’t my service, so wasn’t my problem. Sadly, because it wasn’t my problem I have no clue how it was fixed other than the vendor being involved.
Not a guarentee that you’ll experience anything like that, but I’d personally make some space asap.
A Copy-on-Write filsystem needs to record metadata changes before deleting anything. So ZFS needs free space to free space… At 100% full no change is possible, so the pool is locked read-only and can only be unlocked by adding space (remember the catch-22: you can’t free space by deleting files/snapshots until you have added space).
With a raidz1 vdev you cannot add a drive temporarily—you could not remove it.
Raidz expansion might work at 100% full (or not?) and save you, but do you want to be the guinea pig and find out? (If so, please report!)
Files on this pool can only be uploaded/created manually, no services or apps doing that, so I’m in control. I always leave about 50-100 GB. Now I understand that 100% fill will mess up my pool and will try to upgrade asap, but I’m not gonna let it fill up to 100%
So my question still stands. What about 85, 90, 95 % disk space utilization? Should I keep disk space usage no more than 80% (as truenas recommends) or ~92-97 is fine until I’m ok with performance hit and don’t let it fill up to 100%?
Mind that automatic snapshots could still take up some space (not much, but at the point your pool is, every byte counts).
There will be a performance hit at some point, depending on your work pattern, because ZFS wants free space, if possible contiguous free space, for everything. But is your pool is mostly WORM it may not matter (which is probably the case, because you should be long past this point already).
If you accept the above, and understand you’re living dangerously, you may keep on.
Suhu, your problem is how are you going to ‘add space’.
Replacing each 4TB drive one at a time with a larger drive and resilvering pool until all drives upgraded? Three replacements, three resilvers. Not sure I would trust with your pool above 90%
Adding another VDEV means adding 3 drives in Raid-Z1 but requires 3 additional SATA ports. This would be, probably the safest and the additional space is right away. You data would still be distributed above 95% on the first VDEV and new data would be all, hopefully, added to the second VDEV.
You could create another pool, with a single drive VDEV or two drives in a mirror VDEV. You would now have two pools and would have to move data from the full pool to the other.
There is also Raid-Z expansion of expanding your current VDEV from 3 to 4 hard drives. With how full your current pool is, none of us would recommend that as if it fails, possible loss of the entire pool and data.
The 80%, 90% rules give the user time to perform upgrades.
The 90% mark is probably a good place to stop.
Adding Edit Just be glad you aren’t using block storage. We recommend %50 free space
No redundancy is lost in raidz1 vdev expansion.
One drive fails: Data is still there.
Two drives fail: Pool is lost.
Expansion, replacing drives and adding a further vdev are all vaild solutions. (Expansion with a larger drive, and then replacing the others is also possible.) Make a choice at your earliest convenience.
Just purchased another 4 TB drive (unfortunately I don’t have access to ebay and it’s prices and buyers protection in my country, so I paid ~125$ for new wd red plus 4tb). My plan is to unload some data onto external drive and cloud, maybe delete something if possible and then upgrade using raidz expansion
Thanks again everyone for your help. I will try not to let it happen in the future
The main issue is excessive fragmentation of data that is written.
In essence, zfs will be scavenging little bits of storage scattered all over your disk to write new files.
These files will always be fragmented now, Which means when reading them the drive heads may have to make a crazy hectic journey all over the disk… perhaps hundreds of thousands or even millions of additional seeks.
Read/write perf could end up in the single digits.
While I know we are mostly sysadmins/engineers/developers here, this is one of the best “layman’s” explanation of the ZFS 80% rule. I use this to explain to mgmt why we need to upgrade things:
I’ve worked in IT for well over 20 years but some of the smartest IT folk I know have never worked in the industry.
The IT industry has changed a lot in my time. It’s gone from the classic IT Geek running the whole show to people in suits that often don’t know the first thing about what we would call IT.