PSA: Dragonfish may fail to import pools with excessive snapshots

dan · June 6, 2024, 9:31am

I’d previously reported that I wasn’t able to upgrade to Dragonfish from Cobia–when Dragonfish would boot, the pool import process would take over 15 minutes and then time out, resulting in the system booting without the apps pool online and therefore the apps service failing. Naturally, I reported a bug.

Investigation on that and a few related tickets revealed a regression with upstream ZFS, which was triggered by too many snapshots–over 100k of them in my case. And while I still have to figure out where some of them were coming from, some aggressive pruning got me down to about 10k. I then tried the upgrade again (to 24.04.1.1), and everything worked as expected.

It doesn’t seem like this has hit too many people, but if you’re one of them, try cleaning up your snapshots.

sfatula · June 6, 2024, 5:35pm

Did you have snapshots of your application pool? I know Truecharts used to have a recommendation of taking snapshots and replicating them offsite.

dan · June 6, 2024, 6:02pm

I did, but that wasn’t the bulk of them.

sfatula · June 6, 2024, 6:13pm

I hope you resolved the issue of the why then. That’s a lot of snapshots! I have 1082 with a year a snapshots. But mine are monthly after a month, weekly and daily within. It’s great to know though that it can cause issues with latest Dragonfish. Those can be a mess to cleanup sometimes!

dan · June 6, 2024, 6:16pm

Not quite, but I have a few ideas to run down. One of the problems–which accounted for over half of them–was that the “pull” replication from my Proxmox servers wasn’t expiring old snapshots on my NAS (they were/are going away on the Proxmox servers). This is how I’d been doing it:

…but apparently that process needs some tweaking.

kris · June 6, 2024, 6:19pm

As a further PSA for future readers:

The UI issues an alert beyond 10K snapshots for a reason. We realize snaps are theoretically “unlimited”, but there are practical reasons why you should keep them pruned to reasonable levels to avoid exposing edge cases like this

In this case we are going to fix the underlying changes that made it take forever to load the pool, but there could be others lurking…

dan · June 6, 2024, 6:24pm

Even with my pretty-brutal “pruning”, I have it “down to” about 10k right now–still a lot more than I’d like, but a lot better than 100k.

winnielinnie · June 6, 2024, 6:26pm

Did you use HeavyScript?

I remember reading from the old forums (if memory serves me right, which it almost never does), that HeavyScript goes overkill with snapshots.

dan · June 6, 2024, 6:27pm

I do, and it looks like it accounts for about 5k of my total.

winnielinnie · June 6, 2024, 6:29pm

5K of the remaining 10K?

Or 5K of when you had 100K?

dan · June 6, 2024, 6:35pm

Yes, this. I didn’t touch any of the snapshots it created.

Yes, the number sounds high. But its “backup” takes a snapshot of every dataset in ix-applications every time it’s run, and retains it for as many days as you tell it to (28 days in my case). And if I check how many datasets are in there, I get:

root@truenas[~]# zfs list | grep software | grep ix-applications | wc -l
188

Multiply that by 28, and there’s 5k+.

sfatula · June 6, 2024, 7:08pm

That can definitely be true, made even worse with PVC as that adds datasets too. Really, I don’t see a need to keep more than a couple days since you are almost never if ever going to use those IMHO. Useful for old Truecharts with PVC, not terribly useful otherwise.

I have 85 datasets in ix-applications. Will be interesting to see how this works on Eel, how many datasets will be on ix-applications per app?

dan · June 6, 2024, 10:08pm

…and just as a data point, deleting the snapshots, even by way of a script, took well over 24 hours–when they get up to that many, it isn’t a quick process. But it also freed up about 20 TB on my pool. In case anyone’s interested, here’s the script I used:

#!/bin/bash
for std in `zfs list -H -o name -t snapshot | grep -- 'tank' | grep -- 'auto'`
do
    echo "zfs destroy $std"
    zfs destroy $std
done

Nothing too fancy here, and you’d edit the grep statements to hit particular patterns in the snapshot names. I think the backtick syntax is deprecated now and $(zfs list -H -o name ... ) would be preferred, but this still works.

winnielinnie · June 6, 2024, 10:42pm

You can use % to bulk delete (as a single operation) an entire sequential batch of snapshots.

As example:

zfs destroy -nv tank/mydata@auto-2022-01-31%auto-2024-06-01

This will destroy all snapshots from @auto-2022-01-31 to @auto-2024-06-01

When I say “all”, I mean “all”. It doesn’t matter about the names used. It doesn’t matter if there is a “timestamp” in the name of a snapshot. It will bulk destroy all snapshots within the specified range (based on “creation”, regardless of what they are named.)

If you want to play it safe, you can output what will be destroyed into a text file, just in case you want to review it before committing to it:

zfs destroy -nv tank/mydata@auto-2022-01-31%auto-2024-06-01 | tee these-snaps-would-be-destroyed.txt

You’ll notice I left the -n flag in my examples.

Stux · June 6, 2024, 10:44pm

Including children?

winnielinnie · June 6, 2024, 10:45pm

Only if you specify -r in the parameters.

EDIT: I just don’t recommend this, since children datasets may contain a different importance to its parent, including snapshots “within the range” that you do not want touched. (Even if this named snapshot does not exist in the parent.)

To use the above example, if there is a snapshot named @important-stuff-in-here, and it was created any time between January 31 2022 and June 1 2024, it will also be destroyed.

Even if this snapshot only exists in the child dataset.