Suppose I had to rollback one of my datasets to a way earlier version causing deletion of the newer snapshots of that dataset.
So if I have checked “full filesystem replication” on my replication job, does this cause a problem that there are missing snapshots on the source side?
I’m going to find out real soon. But if I do have problems, this seems problematic because I shouldn’t have to tweak the replication job because I haven’t done anything “wrong.” Or have I?
I’d like to think that as long as the destination snapshots were not deleted, this works fine.
NEW: I looked at the log which tells the story. It will replicate using an incremental from the most recent matching common snapshot.
So if you delete a dataset, it’s going to use the dataset before that as the common one. It will not try to do an incremental off of two snapshot sets.
Logs
[2024/10/03 22:22:01] INFO [Thread-23] [zettarepl.paramiko.replication_task__task_4] Connected (version 2.0, client OpenSSH_9.2p1)
[2024/10/03 22:22:01] INFO [Thread-23] [zettarepl.paramiko.replication_task__task_4] Authentication (publickey) successful!
[2024/10/03 22:22:02] INFO [replication_task__task_4] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2024/10/03 22:22:02] INFO [replication_task__task_4] [zettarepl.replication.run] For replication task 'task_4': doing push from 'main' to 'main' of snapshot='auto-2024-10-03_22-22' incremental_base='auto-2024-10-02_22-22' include_intermediate=False receive_resume_token=None encryption=False
[2024/10/03 22:22:03] INFO [replication_task__task_4] [zettarepl.paramiko.replication_task__task_4.sftp] [chan 5] Opened sftp connection (server version 3)
[2024/10/03 22:22:03] INFO [replication_task__task_4] [zettarepl.transport.ssh_netcat] Automatically chose connect address 'backup.skirsch.com'
Yes, if there are multiple datasets involved (such as children, recursively).[1]
If your rollback is to an earlier date (snapshot) such that the destination is “too new” with its earliest snapshot, then you cannot do an incremental replication anymore.
To be more clear, this can break the “Replication Task” if the parent or one of the children no longer shares a common base snapshot with its counterpart on the destination pool. (Because a Replication Tasks works as a “unit”.) With vanilla ZFS, you can issue per-dataset replications (send/recv), independent of an issue with the parent dataset. (But this granularity can cause further hiccups if you get too complex or nuanced.) ↩︎
But then why create this “split” between source and destination? Why would you want to have a “rolled back” source dataset, but not a “rolled back” destination dataset?
This goes against the principle of doing regular ZFS sends → receives.
EDIT: Two things, just to be safe (“safer”).
Create a checkpoint before you do anything drastic.
Find out what is the “earliest” snapshot on the destination, and then don’t rollback the source any earlier than that.
So if auto-2024-05-01 is the earliest snapshot on the destination, that should be the furthest limit that you can rollback your source. Because if you were to rollback your source to something like auto-2024-04-01, you will essentially break the ability to continue with an incremental replication, and will have to do a full (“from scratch”) replication to the destination… all over again. All of it.
It’s hard to know what happened, but based on the time taken by the replication job, it appears it took the most recent full replication snapshot where everything matched as the base for the replication job. So if you rollback just ONE dataset like I did, it means the reference for the replication job will have to move earlier for ALL datasets.
In theory, it doesn’t have to do that if all you care about is matching the contents of the latest snapshot since you can do an incremental on each dataset separately based on the last incremental in common.