Questions Re ZFS Replication

Hi All,

I’ve been offered a small amount of rack space in a COLO for free (call it a workplace benefit) and planning to move my offsite backup to here. This is giving me the chance to rethink my backup strategy.

I was previously rsyncing bi-weekly/monthly different datasets - but considering swapping this to encrypted zfs replication. Host will on its own isolated vlan with either wireguard/openvpn tunnel back to my house for remote access/replication.

I understand replication tasks (zfs send/receive) are block level tied to snapshots - but there’s 2 things im not to sure about:

  • If a dataset takes daily snapshots - with a max age of 1month is set to replicate every Saturday - is that replicating all snapshots or just the latest?
  • In this same example above - if the last replication happened (lets say 1-Jan) and was unable to replicate for 6 weeks(14-Feb) - how will it understand what blocks need to be copied if theirs no snapshot tieing it to the past (due to expiring after 4weeks)

Original replication will replicate all the data. Next replication will re-sync only the delta.

Replication will replicate from the latest replicated backup. So if there are no replicated backups for X weeks, the backlog will be re-synced.

3 Likes

Sorry I might not have been clear - I understood it would only be the delta - but is that from the latest point in time (including snapshots in between) or just delta from last to current replicated snapshot?

example: snapshot a is the latest ‘replicated’ one - b/c/d/e are in the middle - f is the latest snapshot ready to be replicated. So would b-f be individually replicated or just b-f combined into 1 snapshot on backup side?

Fantastic - Any more details on how the actual re-sync process works?

All intermediate snapshots, in order.

But if no common snapshot if found, it would have to be replication from scratch, i.e. wipe the destination dataset and replicate everything anew. You want to have some snapshots with long retention policy to prevent that.

2 Likes

That makes sense and was what I thought might be the outcome. Given me a few things to think about.

Appreciate the prompt replies =)

If you allow the source to control snapshot deletion on the destination, it should only delete after sending the latest snapshots…

And if you set retain snapshots until replicated…

So, basically it should just work because the base snapshots on the destination won’t get deleted while the destination is down, until
It comes back and all the backlog snapshots are transmitted.

I think :slight_smile:

1 Like

Sometimes even I don’t know about that…

Weren’t there users whose source snapshots were destroyed because their backup server was unavailable beyond the timeframe of snapshot expiration?

1 Like

Last I checked, this is what zettarepl does under-the-hood. I call it the “passing the baton” replication method.

1 Like

Certainly how it used to work.

The key is to have to source control deletion on the destination. Not the destination itself.

But it’s a bit tricky to test this stuff :slight_smile:

So for ‘science’ I did some playing around this morning.

Host1:/Pool1/Dataset1
Host2:/Pool1/Backups/Dataset1

Snapshots every 5mins with a lifetime of 1hour.

After running a full initial replication I let 2hours lapse and tried again. As suggested it does indeed fail as the source snapshots are gone and there is no common link.

So I retested with “Save Pending Snapshots” enabled and repeated the test. replication completed successfully and a small time later the snapshots on the source had been removed.

Interesting enough (more out of curiosity) I randomly nuked a few snapshots on the destination and it appears to take the latest common link and base its delta off that.

I would be interested if anyone has specifics on how the “Save Pending Snapshots” function actually works (i assume it sets some flag to stop the automated cleaning routine from removing them).

1 Like

As far as understand, “Save Pending Snapshots” really is setting a snapshot to have the “Hold” flag set to prevent itself and any more recent snapshot from being destroyed, until the “Hold” as been removed with “Released” command.
Also, I believe setting the “Hold” flag multiple times increase an internal counter. The same goes with the “Release” flag by decreasing that same counter value. Deletion will only occur if the snapshot has expired and there is no “Hold” in place.

Ahh yes I ran into a thread yesterday about the hold/release functions in ZFS and assumed that was the case.

Thanks for confirming this - as its given me a few idea’s to modify our current backup scripts as well.