Integral replication of Time Machine datasets

Recently I was working on a Time Machine backup dataset and caught a misconfiguration I had implemented that I wanted to share.

The Time Machine dataset was being snapshot and replicated to another pool for resiliency. While testing restores, I realized that the backups contained within weren’t reliable. Sometimes, for various unrelated reasons, the snapshot was taken at the same time a Time Machine backup was still locked or in progress. Sometimes the underlying sparsebundle was still being manipulated.

A case of everything working as intended, despite the fact that the outcome would have been bad.

Now the fix.

Time Machine indirectly signals when it has completed a backup to an SMB destination. It sends a specific command to the server. TrueNAS’s Samba VFS module tmprotect (shoutout to @awalkerix) will then automatically snapshot the dataset, but for this to work the SMB share must be configured as a “Multi-user Time Machine”.

Next, I configured a new dedicated replication task configured as follows:

Also include snapshots with the name: “Matching naming schema”

Also Include Naming Schema: aapltm-%s

I also prefer to enable the “Save pending snapshots” option and for replication schedule I select “run automatically”. tmprotect sets snapshot retention to 7 days. but in my testing respects the “save pending snapshots” feature. [Edit: turns out this isn’t the case; TrueNAS prunes these snapshots automatically and it isn’t configurable. I didn’t catch this because of the way I had replication configured. That doesn’t change anything in terms of the ability to replicate “integral” aapltm snapshots, but it does mean they’re time-limited. If you go 7 days without replicating such a snapshot, and you don’t have other periodic snapshots, you may have to do a full replication from scatch. There is discussion about this further down the thread.]

With this configuration a new snapshot will be created automatically when Time Machine actually completes a backup (due to tmprotect) and it will then automatically be replicated to the replication destination. The naming schema matches for the name of the snapshot created by tmprotect.

You may want to add Time Machine datasets to the exceptions for any other snapshot and replication jobs that apply the dataset(s) so that there is no confusion over which snapshots contain a complete backup jobs.

As with anything backup (and data) related make sure you test your backups and assumptions before relying on them!

4 Likes

That kinda explains why mine never did auto-snapshots on completion. I have a “Basic time machine share”. Looks like a bug/flaw to me.

@swc-phil That is actually where this whole thing really began for me! There is a post that I’ve lost (and will try to find) from the old forums about that. As I recall from the post the way to activate tmprotect for basic Time Machine shares was to add auxiliary parameters, which of course is no longer supported.

It would be nice to have support for this on any Time Machine share. I suspect the issue is that with the basic share people were pointing (or could point) multiple machines to one dataset. If more than one is running, one finishes, and tmprotect snapshots the dataset, then some backups would not be finished and might have silent integrity issues. With the Multi-user Time Machine shares (in the code this is called “ENHANCED”) TrueNAS at least forces your share into its own SMB-user-specific dataset, creating one if necessary. It would still be possible to use one share for multiple machines, but the Multi-user preset seems to try to force you into doing it right. Create a Time Machine service account for each machine (for example) and that problem goes away.

This makes me want to test and contribute some documentation enhancements or make a community resource or something. :slight_smile:

4 Likes

Sadly, both of my Time Machine shares are basic, looks like I need to redo all that. Way back when, I think I set them up separately with the mistaken belief that each basic TM could be quota’d from TN.

My time machine sees the quota value for the dataset.

1 Like

That’s a long time ago (CORE!) and likely superseded by now several times over. I should give multi-user TM a go and see if it works well enough. The potential benefit of well-done TM is huge re: replication data consistency.

FWIW, MUTM works well for me.

The trick is to use per-machine accounts, not user accounts.

Then set the quota per dataset… and set the alarms high ie 95% or so. (Still working on this)

That’s precisely what I did. I also set a reservation on the parent dataset so that I can budget Time Machine from a high level, and issue quotas to specific machines.

It’s perhaps serendipitous that I posted this just a day ago. I wanted to update anyone who has interacted with this thread.

I said:

Which I’m not entirely sure is true. That is, I’m not sure if tmprotect is respecting the “save pending snapshot” feature.

Overnight I had a large and long running replication fail. The replication had been ongoing for over a week. I’ve been troubleshooting that and along the way I checked my Time Machine dataset aapltm snapshots and noticed that I only had 7. Given the timelines involved with the long-running replication job I should have had 8 snapshots for the Time Machine dataset.

This may or may not be a bug. tmprotect is not well documented, and I haven’t examined the part of the code which manages snapshot retention. In the meantime I put a “manual” hold on those snapshots with zfs hold and my own hold tag, and will return to this when I have addressed the failed replication. (It’s also worth pointing out that tmprotect is a nice feature but it doesn’t appear to be a “promised” feature, so I’m not trying to be overly critical here. Just wanted to update those of you who have interacted with this post and might be relying on something I said that I now can’t back up. Better to be honest, and all that. @swc-phil, @Constantin, @Stux, @Jeverett)

1 Like

Well, it never worked for me. I’m just taking some weekly snapshots in case the backup becomes corrupted.

Periodic Snapshots, Snapshot trimming and Snapshot Replication are all implemented in zettarepl

Basically, zettarepl deletes snapshots after it replicates them, and thats how it knows not to delete snapshots that are held.

But tmprotect is not zettarepl and is deleting snapshots independently (i think)

I’d suggest using periodic snapshots and the tmprotect ones, and replicate both.

I think you’re right. I didn’t find any (obvious?) calls between zettarep and Samba. So…

Caveat to anyone randomly coming through: Don’t do this at home unless you know what you’re doing.

In an effort to save my existing snapshots for a full re-replication and to trace where the Time Machine snapshots are managed I wrote a shell script that I added as a daily cronjob. It iterates the list of Time Machine snapshots (zfs list -H -t snapshot | grep aapltm) and loops through the results. For each result it places a hold tagged with the Unix epoch timestamp (to prevent snapshot tag collisions - time monotonically increasing is very convenient :smile:).

I figured this would generate an error that I could find by exporting a debug archive, and it did, in log.smbd.

tmprotect is indeed deleting these snapshots itself. You can see this in action in smb_libzfs.c here.

It would be really great if this was something controllable. Either in the SMB share config, the SMB service config, or perhaps in some future iteration zettarepl could act more as a “snapshot management server” to other components outside the services exposed in the Data Protection screen.

I might get around to writing up a cogent feature request thread to be voted on.

2 Likes

Thanks, I actually just disabled the multi-user to get rid of the apple snapshots. I backup to a local usb and then make another backup to Truenas, I didn’t need the extra space that they were using. Good to know if i live long enough to need offsite replication!
Rob