Local snapshots no longer being automatically destroyed by retention policy

MikeyG · April 28, 2024, 12:42am

All of a sudden, my snapshots are not being deleted. To double check, I created new dataset, and a new snapshot task where they are supposed to be removed after an hour, and they are not being removed. There are no replication tasks associated with my new test dataset.

Snapshot list for test dataset where many should be deleted by now:

RAIDZ1/Test@auto-2024-04-27_15-00                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-05                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-10                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-15                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-20                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-25                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-30                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-35                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-40                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-45                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-50                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_15-55                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-00                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-05                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-10                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-15                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-20                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-25                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-30                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-35                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-40                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-45                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-50                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_16-55                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-00                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-05                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-10                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-15                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-20                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-25                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-30                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-35                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-40                         0B      -      140K  -
RAIDZ1/Test@auto-2024-04-27_17-45                         0B      -      140K  -

These are the settings for the snapshot task:

Not sure if it’s the right log, but zettarepl.log shows the snapshots being created, but never deleted:

[2024/04/27 17:15:00] INFO     [MainThread] [zettarepl.zettarepl] Created ('RAIDZ1/Test', 'auto-2024-04-27_17-15')
[2024/04/27 17:20:00] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Periodic Snapshot Task 'task_45'>]
[2024/04/27 17:20:00] INFO     [MainThread] [zettarepl.snapshot.create] On <Shell(<LocalTransport()>)> creating recursive snapshot ('RAIDZ1/Test', 'auto-2024-04-27_17-20')
[2024/04/27 17:20:00] INFO     [MainThread] [zettarepl.zettarepl] Created ('RAIDZ1/Test', 'auto-2024-04-27_17-20')
[2024/04/27 17:25:00] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Periodic Snapshot Task 'task_45'>]
[2024/04/27 17:25:00] INFO     [MainThread] [zettarepl.snapshot.create] On <Shell(<LocalTransport()>)> creating recursive snapshot ('RAIDZ1/Test', 'auto-2024-04-27_17-25')
[2024/04/27 17:25:00] INFO     [MainThread] [zettarepl.zettarepl] Created ('RAIDZ1/Test', 'auto-2024-04-27_17-25')
[2024/04/27 17:30:00] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Periodic Snapshot Task 'task_45'>]
[2024/04/27 17:30:00] INFO     [MainThread] [zettarepl.snapshot.create] On <Shell(<LocalTransport()>)> creating recursive snapshot ('RAIDZ1/Test', 'auto-2024-04-27_17-30')
[2024/04/27 17:30:00] INFO     [MainThread] [zettarepl.zettarepl] Created ('RAIDZ1/Test', 'auto-2024-04-27_17-30')
[2024/04/27 17:35:00] INFO     [MainThread] [zettarepl.zettarepl] Scheduled tasks: [<Periodic Snapshot Task 'task_45'>]
[2024/04/27 17:35:00] INFO     [MainThread] [zettarepl.snapshot.create] On <Shell(<LocalTransport()>)> creating recursive snapshot ('RAIDZ1/Test', 'auto-2024-04-27_17-35')
[2024/04/27 17:35:00] INFO     [MainThread] [zettarepl.zettarepl] Created ('RAIDZ1/Test', 'auto-2024-04-27_17-35')

The last time zettarepl.log shows local snapshots being destroyed was April 17th, which I believe was a day or two before I started the replication task that is currently running:

[2024/04/17 18:00:10] INFO     [retention] [zettarepl.zettarepl] Retention destroying local snapshots: []
[2024/04/17 18:30:10] INFO     [retention] [zettarepl.zettarepl] Retention destroying local snapshots: []
[2024/04/17 19:30:10] INFO     [retention] [zettarepl.zettarepl] Retention destroying local snapshots: []
[2024/04/17 20:00:10] INFO     [retention] [zettarepl.zettarepl] Retention destroying local snapshots: []

About 10 days ago I started a replication task on another dataset to a remote site. That task is still ongoing since the link is slow. I would think replication should be totally independent of snapshot deletion, unless “save pending snapshots” is selected on the replication task, which is of course not the case for a new dataset created specifically to test this. Thought I would mention it just in case.

Anyway, any ideas on why this would stop working all of a sudden?

Forgot to mention, I’m running Core 13.0-U5.3.

winnielinnie · April 28, 2024, 1:08am

My guess is that your other (seemingly “unrelated”) Replication Task includes the root dataset?

If this is the case, it’s likely that the “save pending snapshots” option is being applied recursively to all children datasets, including your “Test” dataset.

But it’s only a guess.

What is your other replication task? Does it indeed include the root dataset “RAIDZ1”?

MikeyG · April 28, 2024, 1:16am

Here is the replication task that is running:

It does use the same parent dataset, but this is a one off replication task, and I did not select “save pending snapshots” on it.

You can see here that is does include the root dataset which is RAIDZ1. My test dataset was RAIDZ1/Test.

I do not understand why this would be an issue though even if “save pending” was selected. A replication task runs on one dataset, and I would think snapshot settings on a completely other dataset should be independent.

Also, I have another completely separate pool that seems to be exhibiting the same behavior. It’s supposed to destroy snapshots after weekly replication is complete (which has been completing) but like I mentioned, that has not happened since 4/17 according to the log. I will need to set up another snapshot task on it to double check though.

MikeyG · April 28, 2024, 2:41am

Confirmed this happens with the other pool as well.

Only existing replication task for that pool:

Test snapshot task:

Snapshots are still retained an hour and a half later.

They are on the same pool, but the replication task that exists is on Main/Files. The test is being done on Main/Test. They share the parent dataset of Main, since the pool is called Main and there’s no way to avoid this.

To me, this is not expected behavior for a snapshot task that has nothing to do with any other tasks.

MikeyG · April 28, 2024, 3:31am

I ended the replication task since it was going to take forever and I figured out a way to make it much faster. My test snapshots immediately started automatically being destroyed as they were supposed to.

This implies that an ongoing replication task halts all snapshot retention policies from being implemented, and that it was not caused by a different replication task having “hold pending snapshots” selected.

This is not what I would expect as a user. Snapshot retention, aside from datasets being held intentionally, should not get cancelled out by replication tasks, especially if those sets of snapshots are not even selected to be part of a replication task. I’m guessing that whatever is in charge of replication is also in charge of destroying old snapshots, and it’s getting held up while a replication task is running. I also assume that I could run a zfs send/receive manually to get around this if I think it will run for a while.

Could someone confirm if this is all expected behavior? Doesn’t really make sense to me.