Pool filled up, now can't mount/import it

mpkhs · August 1, 2024, 1:54pm

This is on Core using TrueNAS 13.

My 8TB pool filled up about 1T during the nighttime and I expect it filled up completely. This has consequences, I know, but I’m trying to save the data.

The errors I get at the time of importing the pool:

Syncing all disks complete!
Alarm clock
Starting file system checks:
Mounting local filesystems:.
Beginning pools import
Importing Main Pool
vdev.c: 161:vdev_dbgmsg(): disk vdev '/dev/ada1': best uberblock found for spa $import. txg 44704638
spa_misc.c: 419:5pa_load_note(): spa_load($import, config untrusted): using uberblock with txg=44704638
spa.c: 8392:spa_async_request(): spa=$import async request task=2048 spa_misc.c:419:5pa_load_note(): spa_load($import, config trusted): LOADED spa_misc.c:419:spa_load_note(): spa_load($import, config trusted): UNLOADING spa.c:6110:spa_import(): spa_import: importing Main Pool
spa_misc.c: 419:5pa_load_note(): spa_load (Main Pool, config trusted): LOADING vdev.c: 161:vdev_dbgmsg(): disk vdev: /dev/ada0': best uberblock found for spa Main Pool. txg 44704638
spa_misc.c: 419:5pa_load_note(): spa_load (Main Pool, config untrusted): using uberblock with txg=44704638

That last bit keeps repeating a few times with different txg values.

After that it runs out of memory it seems:

pid 322 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory
pid 486 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory
pid 485 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory
pid 468 (trace), jid 0, uid 0, was killed: failed to reclaim memory
pid 347 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory
pid 352 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory
pid 467 (python3.9), jid 0, uid 0, was killed: failed to reclaim memory

And it crashes.

What I’ve tried:

Boot with Single User mode enabled: works, but importing pool fails similarly. I can import with -o readonly=on flag but that doesn’t allow me to fix an issue. I also don’t have 8+TB of storage laying around to copy it to.
zdb -e -bcsv "Main Pool" - Took 3 days, didn’t fix anything
zfs list -t snapshot -r "Main Pool" → remove snapshots (it only had a few for system, few MB total)
zfs create -V 32G -o org.freebsd:swap=on -o checksum=off -o compression=off -o dedup=off -o sync=disabled -o primarycache=none boot-pool/swap Create a swap file on the boot disk (128GB SSD) but it doesn’t seem to use it.

I’ve tried plenty of other things that didn’t work, like going to a previous working txg with zpool import -T xxxxxxxx but that gives an error that there is metadata corruption. Importing with -F (rewind) -FX (rewind a lot further) but both did nothing.

Is there anything left to try ?

mpkhs · August 2, 2024, 10:06am

I’ve been trying to find causes but I’m coming up empty. All datasets report atleast a dozen GB of free space (total free of pool), but I’m still expecting the cause to bethat it’s completely full because it gives the “failed to reclaim memory” errors, which to me seems like an issue with swap file.

I’ve tried (in Single User mode) with zpool import -f -R /mnt -FX "Main Pool" and that succeeds, but doesn’t mount because of the read-only nature of the Single User mode. I’m currently running a scrub, trying to see if this works and would somehow fix it, but I’m at a loss if this fails as well.

A way to discard some of the last changes it has queued might fix the issue but I’m not finding a way to do that.

mpkhs · August 5, 2024, 8:21am

I tried with booting the server without disks, which works. I can reinsert the disks after successful boot and using the above command (without -f) it fails again, with the error:
swap_pager indefinite wait buffer

Apparently, the swap space on the boot-pool is filling and causing the crash. I’m trying to figure out how to change/increase that as the 128GB drive only uses a fraction of its capacity. I’m not sure it will work, if it tries to handle more than 100GB of data.

Arwen · August 5, 2024, 6:17pm

Are you using De-Dup?

ZFS De-Dup is memory intensive and can cause pools to be un-importable due to low memory. It’s a bit tricky because someone might just try a reboot and find that their pool with De-Dupped datasets is no longer importable.

This is one reason why ZFS De-Dup is not recommended for casual users.

Davvo · August 5, 2024, 6:26pm

Full system specs please. I’m not overly sure the reclaimed memory error refers to your pools.

mpkhs · August 6, 2024, 6:44am

@Arwen I’m not using the feature, but thanks for the suggestion !
@Davvo It’s an HP MicroServer G7 N40L (AMD Turion II Neo N40L) with 8GB of DDR3 ECC memory, TrueNAS 13.0 U6.2 (updated a few days ago as a troubleshooting action) on 120GB Sandisk X400 SSD (previously had 13.0 U3 and 12.0 U7). The storage controller is configured as AHCI with write-cache disabled. The zpool consists of 2 mirrors (2x 3TB WD Green, 2x 6TB Seagate Skyhawk Surveilance), which results in no issues except for 1 URE on one of the 3TB drives when doing a scrub or SMART.

I’d urge to not look at this as a new setup or new issue, this setup has been running in this configuration for years without issue. It also successfully finished a scrub action yesterday of 16 hours (as read-only) and zdb -b (verify checksum of metadata blocks) of 56 hours (when pool was not imported).

It also is stable when the pool is imported as read-only (through single-user mode or normal boot) and it failed during a write action while it was filling the pool overnight. The system is monitored by Zabbix and it saw the swap-space of the boot-pool (32GB) being used over a 100% before it crashed, while troubleshooting yesterday.

Davvo · August 6, 2024, 8:31am

Oh.

The issue might just be the low RAM, in that case adding more would be helpful.
The pool being stable in read-only however makes me suspect the storage controller going rogue…

Now, understanding what changed that made a previously stable system unstable will be harder to do.
What is the status of the boot drives?

mpkhs · August 6, 2024, 10:00am

The boot drive is a singular Sandisk X400 120GB SSD, I’m not sure what I can tell you about it except that it seems to work.

I have the pool mounted as read-only and am now copying it to an 8TB drive I just bought through a simple cp -R action in Shell. Though it seems I will be needing an extra drive as it will barely not fit (7.4TB vs 7.2TB).

No errors yet about swap space or anything else as it has copied over 450GB at time of writing.

Davvo · August 6, 2024, 10:04am

Found the following Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem].

mpkhs · August 6, 2024, 10:21am

I’m not sure it’s the same cause, maybe the same effect.

My issue arises as soon as it imports the pool, at any time, not as read-only.
I don’t use Rsync on this machine at all, it’s just a NAS with the SMB protocol active. It crashes well before SMB is active (during boot for instance) as it doesn’t even have network connections at that point.

My ARC size is 5.77GB at the moment.

mpkhs · August 7, 2024, 6:52am

Overnight the copy to the temporary pool finished without issue. I’m now scrubbing it to make sure the data is intact.

I omited one folder of about 408GB as it is not needed anymore.

What’s strange is that the defective pool has 7.85 TiB used, while the temporary pool has 6.86 TiB used. Strange as only 0.4TiB should be different, not 0.99TiB. I checked and it seems everything is copied over. I also have no snapshots on those pools and it’s mainly singular large files.

I also found that when the pool crashed, the last record from Zabbix (monitoring solution) was 99.25% of space used and it only probes every few minutes I believe.

When the scrub finishes, I’m going to destroy the defective pool remake it and copy the data back.

mpkhs · August 8, 2024, 7:36am

Scrub of the temp pool was OK, I’ve recreated the main pool, which didn’t work in read-only mode but did work when it wasn’t imported. Datasets created, the data is now being copied from the temp pool from the 8TB disk to the recreated main pool.

@Davvo you mentioned the boot drive(s), it is now giving “read DMA” errors, 3 in the last hour. Strange as I haven’t seen these before. I’ll be reinstalling it on a different SSD after the copy is over.

mpkhs · August 9, 2024, 7:21am

Today I changed the boot disk (same brand and model) and reinstalled TrueNAS. I mistakingly had TrueNAS Scale loaded on the USB drive for installation instead of Core, which was another troubleshooting step I was thinking about earlier. I only noticed this after it was booting for the first time.

Everything went well, even loading the Core config file went without a hitch. I’m now in the process of checking that the shares work correctly. I wanted to end up with Scale anyway, which was on the agenda for later.

For people that had this issue and skipped to the end:

I was not able to solve the issue. The cause in my case still seems like it was filling the zpool up to capacity (quotas don’t change the outcome it seems) and it isn’t able to do any outstanding changes. The only solution was to disconnect the drives, let the system boot, through the Shell import the pool as read-only and copy the data to a new disk or pool.