I have a ZFS volume which has been overloaded, only 6 GB left on 1.5 TB (I am pretty sure dedup was activated).
The TrueNAS Community 25.04.2.1 crash and reboot too soon so I cannot install the pending update.
I was able to cancel the scrub which was automatically launched after the import. I thought it was the responsible of the issue as it stopped after about 600 MB, I left it running for more than 24H without any progression.
I was able to import the volume on a debian live system and tried to remove some data. The removal freezes quite quickly, the live CD is still running but the pool is inaccessible (every command on the pool just hang without any feedback). After a reboot on the LiveCD, the removed data is still there and there is no change on the space usage (the removed data is still there).
I tried the TRIM command, the freeze still occurs.
A dmesg shows errors like “task zpool blocked for more than 240 seconds” and “task z_fr_iss blocked for more than 300 seconds”.
This is more or less the same messages I can see under TrueNAS before the automatic reboot.
This is a test machine, so it is not a big deal, but I would like to see if it is possible to recover from such situation (and this thread can be a reference for other people).
The ZFS choice appears to me as a robust solution so I cannot imagine filling a volume at 100% can definitely kill the pool, and if it is the case, why not sending a “disk full” status at, for example, 90%? I have correctly received the warnings but as it was overnight, I did not had time to cancel the task which filled the volume.
I hope I have provided all necessary information, feel free to ask for anything.
Let me know what I can try and I will provide the thread with the results.
Going to start with the useful advice first: see if there are any snapshots on the pool that you can delete. It’ll likely be the easiest & quickest way to free up some space.
Now for the rest of the advice:
I’ve seen production systems at 100% usage become completely unresponsive to any & all commands even after reboot - so, this is well within the realm of possibility.
First of all, thanks for your time to try to help me.
Unfortunately, no snapshot, and even if I had some, I guess I would had the same issue as when trying to delete the data, a freeze.
I received the alerts but during the night while the script was still running and me… sleeping. Once I woke up, it was too late.
If you saw such behavior on production systems, it might be an improvment to prevent this?
I can try to copy my data to another NAS and put it back but it will quite long, I cannot imagine the downtime in a production environnent.
The hard disks are mechanical ones, as I said, test environment: a desktop computer with desktop hard disks. It is slow but I play with FreeNAS then TrueNAS for years without issues.
Anything I can try or should we consider the pool unusable for good and leave other people in such situation without solution
?
I’ve asked because I thought that trim is only applicable to SSDs. However, quick googling showed that it could also be applicable to SMR HDDs. And ZFS doesn’t “like” SMRs. It is perhaps unrelated to your issues, though…
Unfortunate that there aren’t any snapshots; are you sure though? Whats the output of: zfs list -t snapshot
Maybe one was created at some point & has been eating up untold amounts of space. It might explain why deleting doesn’t do the needful, as the space wouldn’t free up until the snapshot is gone.
Edit: things aren’t hopeless, but we need to find a way to clear up some space & then life will be good again.
Only “system” snapshots and they are very small
I reinstalled 25.10 and left the system without action: the pool is not mounted and the computer does not restart.
I import the pool and again, freeze and unable to delete data.
Unless magic idea/command, I guess the pool is definitely lost.
Meanwhile, I will check if I can at least copy the data, I will also see if I need to import as read only to avoid the freeze…
@HoneyBadger sorry to bug, but what’d be the next steps if trying to delete files hangs the system & no snapshots available to prune on a 100% full pool?
The underlying problem here is that due to ZFS’s COW* architecture your pool is in a state where it cannot do almost anything. If the system were up and pool imported, I would suggest trying to add a device, just to create some space to work in. Unfortunately, I suspect when the pool is being imported some data needs to be written out (ZIL playback maybe) and there is insufficient space to do such. Until that is resolved I don’t think the import can succeed.
I am not aware of any way out of this situation. In the early days of ZFS I would always create a dataset of 1GiB (small pools) or 4 GiB (large pools) that remained unmounted (no mount point set) with both quota and reservation set. This gave me some buffer if I did happen to fill a pool to 100%. I believe that OpenZFS does reserve a certain amount of space for this, but I have not run a pool out of space hard enough to need it recently.
A good rule of thumb for space is when you get above 80% used you need to grow the pool or destroy/delete some data (snapshots).
*COW: Copy on Write, ZFS never overwrites any data in place, it always allocates new space for any writes. This means that it needs space to write that new data (which may be used data or internal ZFS metadata). If you fill a pool completely (no free space), then ZFS cannot allocate any new space and (virtually) all operations come to a halt looking for free space. At that point the only option to recover the pool is to grow it so that there is free space available.
I will be very interested in learning if a read only import succeeds. I do not know how read only a read only import actually is, for example, is a history record written logging the read only import.
For this test environment, I cannot add space, maybe a clue for someone who runs in such situation to solve it?
As the machine crashes very quickly (less than 15 minutes), I doubt a replication would work.
The import succeeds, but it is the reason of the crash of the machine.
I have imported the pool under my debian live, I have not set it to read-only and I am copying some data on an external drive, it has been running for several hours without crash. Once I have finished the copy, I will check a little bit the data (I will probably not detect any corruption) then I will destroy the pool and recreate it before putting the data back.
I am aware of the good practice of the 80%, but as I said, this happened unexpectedly because of a running script during the night, thing that can happen in a production environment as well.
Based on the “reservation” suggestion, I think it could be an internal behavior of ZFS or maybe TrueNAS can handle it to make this reservation and only the system is able to write in this area, reservation, also as suggested, with a size depending on the volume size.
I will continue to check after Monday, I will let you know about the results, but meanwhile, if you want me to try some things before I destroy the pool, feel free to ask
6GB free is not 100% full, so I think we’re actually seeing a dedup-related problem if this is correct:
The symptom of “imports okay, but freezes on delete” is pretty indicative of blocking on DDT records. More RAM might be able to partially mitigate this if you’ve crossed the max RAM boundary, but it’s more likely binding up on the disk IO itself.
@HomeBoy if you can get the pool imported and you aren’t actively copying data off, please try running
zpool status -D yourpoolname
It should show a deduplication histogram at the bottom if you have it enabled, and that’s what we want to see.
(You likely shouldn’t enable dedup on the newly created pool either.)
Yeah, I’ve hit that before. When ZFS runs completely out of space it just locks up. What worked for me was adding a small temporary drive with zpool add, freeing a few gigs, then removing it once the pool was stable again. Not pretty but it saved the pool.
Please find the result of the command, I also put a df to show the 100% usage.
I cannot start 25.10 anymore, the kernel panics while loading the pool (this is why I ran again the live Debian to get the requested data). I am close to destroy the pool, I will just wait for your feedbacks
You do appear to at least be getting some value out of deduplication, but the updates to your DDT are likely what is killing your performance and causing the syncio/deadman timer to trip, as it’s small/sub-4K IO on a RAIDZ1.
Is the pool presently mounted read-only on the Debian instance? That’s likely what lets it remain conscious here.
Look at the DSIZE columns for allocated (actually used on disk) vs referenced (logically written) - you’re squashing what would’ve taken 1.89T of disk space into 1.25T instead, yielding approximately a 1.5:1 dedup ratio.
However, that’s costing you ~2.38G of memory to index, and more crucially 10.7G on disk to store. That’s a small amount of space, yes, but that 10.7G is composed of tiny sub-4K sized records. Deleting data causes ZFS to have to read through those deduplication tables (in-RAM, quick) and then have to update and decrement the counters (on-disk, very slow) because you’re asking a RAIDZ1 of spinning disks to do lots of little I/Os.
You can try to delete just a single large file - don’t do a recursive delete - and see if it completes the free.
Start a second terminal/SSH and run tail /proc/spl/kstat/zfs/dbgmsg on a watch or cycle to keep an eye on whether or not it’s doing frees/deletes, or if it’s just a flurry of metaslab_load and metaslab_unload as it tries to juggle the dedup table in and out of memory.