ZFS Replication with pre- and postscripts

Agrarvolution · November 29, 2024, 7:25pm

I’m currently trying to figure out how to backup my data to a backup server.
Everything is setup and the credentials for SSH are working.
But I’ve hit one road block.

I can’t guarantee / it is not intended that the backup server is online 24/7.
So I want to append a prescript to start the server via ipmi, in case it is offline.

Sadly I’ve seen that this isn’t possible via the GUI. I saw that there is a feature request for that, but nothing really came out of it. Jira

How can I best replicate the settings I’ve chosen in the GUI as a custom cron job? I want to duplicate the data with intact ACL and after the first backup, I want to incrementally change the data.

My issue is that I’m currently not experienced in writing ZFS instructions myself. To give some context - I’m failing at the manual snapshot creation.

Thanks for your help!

dan · November 29, 2024, 9:15pm

I don’t know about the rest of your question, but ~~creating a snapshot is simply a matter of zfs create pool/dataset@name.~~

Edit: wrong, see below.

Agrarvolution · November 29, 2024, 9:29pm

Ah so it isn’t “zfs snapshot pool”?
Yeah, that’s why I wouldn’t work.

I tried replicating the whole pool - does that mean you always have to specify the dataset as well? In the end it doesn’t make a huge difference in my case.

etorix · November 29, 2024, 9:41pm

A pool is physical storage. Datasets are logical storage.
Snapshots are a logical entity, so they go with datasets.

Each pool come with a root dataset of the same name (which may not help with the distinction), but it is advised to NOT use this dataset for storage: Create one or more of your own datsets in there, and snaphot this/these dataset(s) of yours.

For replication, create the snaphots with a task and use a replication task—no manual creation involved.

Agrarvolution · November 29, 2024, 9:54pm

Yeah that makes sense. It is already set up like that. Snapshots are running for about a year now without any issues and even saved me sometimes already. It’s been too long since I dealt with it the last time, so I guess you try stupid stuff like that.

The only thing I’m missing is the replication step.

What will now happen if the other Truenas machine is offline when the replication task is due?

SmallBarky · November 29, 2024, 10:01pm

Have the Backup server use Pull replication?

etorix · November 29, 2024, 10:02pm

If this is PUSH replication, the (always on?) source machine will log an error.
If this is PULL replication (initiated by the target), nothing will happen.
In either case, replication will catch up the next time the replication task is run—provided that there is still a common snapshot between the two machines.

So if the destination is occasionally off, it will just work. If the destination can remain off for months, or for longer than the retention time of snapshots, Bad Things™ will happen.

Agrarvolution · November 29, 2024, 10:03pm

This is something I wasn’t quite sure on yet. Nothing is set up on both end so far.

The main server has all the datasets with data. And the backup server has a pool with 2 data sets, where I want to copy the two pools of the main server into.

I read somewhere that I have to set it up twice anyways or did I misunderstand that as well?

Agrarvolution · November 29, 2024, 10:04pm

Ah so if I caught the log error and then booted the machine it would “just work”, if it is online long enough?

etorix · November 29, 2024, 10:16pm

Yes.

A single replication task can include multiple datasets and multiple snapshots schemes in one go; quite possibly from multiple pools as well—but if you have multiple pools and multiple datasets for different storage needs, you may have different backup requirements as well.

However you set up twice in the sence that there is one task for taking snapshots and then one other task for replicating said snapshots.

Agrarvolution · November 29, 2024, 10:18pm

But I can’t turn it the other way round and e.g. signal from the backup system that it is ready to receive?

etorix · November 30, 2024, 9:15am

Sure: That’s what PULL replication does. The snapshot task on the source creates snapshots. The replication task on the destination requests these snapshots from the source (fails if source is down, and catches up if a next run is successful).

dan · November 30, 2024, 10:14am

No, it (almost) is–don’t know what I was thinking above. But you still need a name for the snapshot, and if you want it to include all datasets in the pool, you’d need the -r flag: zfs snapshot -r pool@name.

Agrarvolution · November 30, 2024, 12:23pm

Makes sense. Thank you.

I read a bit more in the ZFS documentation from Oracle, trying to understand it.

Does it makes a difference if you pipe or < the ssh instruction for the receiving end?

Since I already have periodic snapshots running. I don’t really have to add a new snapshot right before I send the backup, right?
I noticed in the snapshots I already have, that even though they are taken recursively, when I look into its contents, when I need a lost file, the subdatasets aren’t included in the main snapshot but in the snapshot of the subdataset. If I then zfs send those to the other server, will they work in the same way?

dan · November 30, 2024, 12:30pm

You’d be better off in the OpenZFS docs, though Oracle’s seem to be ranked higher in Google.

Not unless you want to maintain those snapshots on a different schedule or something. But a snapshot’s a snapshot; there’s nothing magical about snapshots created by the TrueNAS middleware.

They will, so if you want that to be recursive, you’d use the appropriate flag for that (-R, IIRC).

Agrarvolution · November 30, 2024, 1:09pm

Ok. So this has to be included for send as well, not just when taking the snapshot?

What happens with ZFS send out of the box, if two snap shots are pretty similar. Does it always send all the data or does it automatically just send the latest changes. I saw that you can specify from which snapshot it should send incremental data from, but this seems hard to automate when it is in a cron job.

Agrarvolution · December 1, 2024, 6:23pm

I think I almost got it working now, but with the receive command I specified I keep getting the “too many arguments error”

It looks similar to this:
ssh backup zfs receive -s '/mnt/Pool Name/dataset/subdataset'

The send command would work without specifying a snapshot - the openzfs manual says this should work here as well? Are the quotation marks (because of the whitespace) the reason, why this command is failing?

Edit: Send doesn’t work either - it says unsupported flag with filesystem or bookmark. So I have to specify the snapshot even though I don’t know which one it is?
Edit 2: I think now I finally get that this isn’t allowed without snapshots. But the error on the receive is still there.

crpb · December 1, 2024, 7:00pm

Just my 2 cents on replication with pre and post script.

Create a manual replication Task
Create a proper script with error catching logging and whatnot
Run this script via e.g. cron
Replication execution is done via
- midclt call replication.run $ID

For easy proof do this with any kind of replication task present and enabled in the system

midclt call replication.query | jq '
    .[] | .id ' |  while read -r ID
       do midclt call replication.run $ID
done

Agrarvolution · December 1, 2024, 7:15pm

Oh. Thanks. This seems so much more straight forward.

What would you recommend for error handling / where to look how other did it?

crpb · December 1, 2024, 8:03pm

Errors of the replication are directly accessible via replication.query aswell as a good result from the last run and whatnot and if it’s currently !running!
Just run the query w/o any filter and you will see.

To not be flooded if you have many jobs

midclt call replication.query | jq 'last'
TrueNAS Websocket Documentation