I’m COMPLETELY lost on the topic of replication and backup on TrueNAS/ZFS. The best I can think of is to describe the scenario and ask the questions that come to mind.
I have the following basic scenario:
I have a pool with, let’s say, 5 datasets. Each has between a few GB to TB of data.
Each dataset has a snapshot task daily that keeps snapshots for 2 weeks, so each dataset will have 14 snapshots at a time
These Snapshots would usually be, with light use, between a few KB, MB or GBs, since it’s only saving the differential to the last snapshot
What is the size of the first snapshot? Is it essentially empty?
So when I use replication to a different pool (in my case a 16TB external HardDrive):
I copy snapshots from one pool to the other
TrueNAS suggest I copy all 14 snapshots. Why? I only want to backup my data at a certain time.
If I copy the snapshots: I assume they are between a few KBs or at most GBs big. Is that even a backup? Can I restore data from those snapshots without the original data?
TrueNAS suggest that I delete the snapshots after a period. Why do I want a period? I just want to have 1 backup at the time of the replication task. Do snapshots not work for that?
What happens if I don’t backup for 2 weeks. Will my snapshots get deleted from my backup?
What I want in the most basic case:
I have my pool with datasets
I want these datasets to be replicated to my backup drive in case anything happens.
I want to be able to restore all data in case of a catastrophic failure
Can someone explain this to me? Not a single YouTube Video or Blog post I read makes any of this clear to me.
This is because the snapshot names match the pattern of a Periodic Snapshot Task or naming schema. You can specify a more granular selection in the Replication Task’s options, but it’s really not worth it.
See my first comment.
You can keep snapshots forever if you want. TrueNAS probably suggests pruning so that you don’t fill up your pool if your snapshots contain large amounts of unique data. It’s also nice to keep things tidy.
Supposedly TrueNAS tries to prevent this by not pruning common base snapshots from the destination before a successful incremental replication. Supposedly.
I can confirm that under 25.04.1 TrueNAS replication does not delete the most recent snapshot of a replicated dataset, even if the dataset is deleted on the source.
Because of the way ZFS Replication works. The first replication copies a specific snapshot from the source to the destination, this is the equivalent of an old school FULL backup. The next replication copies the differences between the first snapshot and the second snapshot which will just be the blocks that were changed between the time of the first snapshot and the time of the second. This is the equivalent of an old school INCREMENTAL backup.
You need both the original snapshot (FULL) and all subsequent snapshots (INCREMENTALs) to have a complete copy of the data. ZFS is smart enough that if you delete the original snapshot, it knows that the second snapshot refers to the data that was part of the original snapshot so it does not delete the data.
To reduce the amount of space used. Each snapshot will keep the space that is unique to it from being freed up for other uses.
It sounds like you are saying you only want one FULL backup and don’t care about the time to create the FULL backup. On my home system it takes over 24 hours to complete the initial replication (the first FULL copy). Each incremental copy (I take a snapshot every hour and keep for 2 weeks, I have other snapshots I keep for longer), takes a few minutes to replicate (unless I made huge changes :-).
The way businesses typically describe data backup is in terms of how often and how long to keep. For example:
Backup every hour and keep for 2 weeks
Backup once per day (at midnight) and keep for 2 months
Backup once per month (1st day of month) and keep for 2 years
So the restore resolution will depend on how old the data is. Data within 2 weeks can be restored to hourly resolution. Data more than 2 weeks old but less than 2 months old can be restored to daily resolution. Data more than 2 months old but less than 2 years can be restored to monthly resolution.
How the system accomplishes the above does not matter as long as you can get the restore resolution you need. So the details of how many and which snapshots are saved and replicated really do not matter as long as you can get your data back with the time resolution you need.
So how would you describe your backup / restore needs in terms of the above; how often and for how long?
The backup drive is in 99% supposed to be a cold backup. For hot backup(/restoring up to 14 days old changes) I already have the snapshots on the original pool itself.
Ideally I want to backup the current state of the source dataset to the backup dataset/pool at least once a week, ideally probably every other day or 3 to reduce load.
Why the time feels like such an issue to me: Ideally I want to specify number of snapshots. In a minimal case I want to have the last “1” snapshot backed up to the backup pool that contains all my up-to-date data.
Doing for example daily backups and keeping for 1 day sounds like a race condition or instability waiting to happen that deletes the backup.
Why I don’t want let’s say 14: For restoring data from snapshots I can do that directly from the source dataset snapshots and any snapshot beyond the latest on the destination/backup pool is wasted space on the backup drive.
I have the same issue with regular snapshots with TrueNAS. Ideally I want to keep the last 50 snapshots of a dataset I change regularly. I don’t care how old they are. If I change a file once a year ago and another file yesterday I think in number of changes (1 change) and not in how long ago the change happened.
To summarize my goal:
I have 14 rolling snapshots on the origin dataset (2 weeks of daily snapshots that I would love to change to “number of non-empty snapshots”, but TrueNAS does not support it, separate topic)
I want to use the least amount of snapshots and least amount of load to have an up-to-data backup on the backup pool, that I only access/read from in case of a critical failure.
(I feel like doing an rsync nightly would be the easiest way, but I would prefer to use the idiomatic zfs solution if possible. I also assume it’s better performance- and load-wise)
That is not how TrueNAS managed snapshots work and that is not how businesses deal with snapshot copies of data, so it is unlikely that TrueNAS will change to support that methodology, but I don’t speak for TrueNAS.
Define up to date? Daily, Weekly, Hourly?
You ignored my questions and instead restated what you already said. You are obsessing over number of snapshots for no good reason.
You might be approaching ZFS, snapshots, and replications with a traditional understanding of file-based solutions.
Some of your questions seem to pass over the benefits of block-based solutions and ZFS altogether.
Why does this matter to you? What if each snapshot consumes only a few KB or MB of otherwise deleted or modified data blocks that do not exist elsewhere?
Other than a long listing of snapshots, what is the real harm? You’re only saving daily snapshots. At most, that’s 365 snapshots per year if you do not prune anything at all. You probably have dealt with folders that contain hundreds of files. It doesn’t affect your day to day operation.
Imagine someone says to you that with the snap of a finger, they can create an immutable full proof USB drive of your entire filesystem.
He can do this any time you want. One snap of a finger, another indestructible USB drive exists with the current state of your entire filesystem.
After a year or so, you have all these USB drives. From any one of them you can browse and recover deleted or modified files that do not exist on your current filesystem.
Now imagine he says he can make these drives weightless and not take up any physical space.
You might be approaching ZFS, snapshots, and replications with a traditional understanding of file-based solutions.
Some of your questions seem to pass over the benefits of block-based solutions and ZFS altogether.
I know there are benefits, that’s why I said “but I would prefer to use the idiomatic zfs solution if possible. I also assume it’s better performance- and load-wise)”
The whole thread is because I want to understand it better.
Why does this matter to you? What if each snapshot consumes only a few KB or MB of otherwise deleted or modified data blocks that do not exist elsewhere?
My understanding is that the snapshots already exist on the origin. When I need to only use the backup for catastrophic-failure-recovery, I don’t need to, at the same time, get a file from a few days before the catastrophic-failure. At least thats my assumption.
I also want to keep the size predictable. When I have 10TB of data on my origin pool, I want to be sure that it fits on the 16TB backup pool. When I have a number of snapshots it’s 10TB + whatever the snapshots are. I’m not sure if you can easily look up the space of the sum of snapshots. Apart from that it’s mostly about using as little space as possible on my space-limited backup pool e.g. when many larger file changes happen in a short time like media transcoding.
if you do not prune anything at all.
What do you mean by “not prune anything at all”. Do you mean, delete for example snapshots manually out of the 365 when they take up too much space?
Just recently I moved files around a lot or transcoded some movies which resulted in Terrabytes worth of snapshots. Things like that are a concern. Apart from that I agree with you.
so it is unlikely that TrueNAS will change to support that methodology
Makes sense. As a Developer, not IT admin, I think more like Git, which is change-based (commits), not time-based (nightly snapshot etc.).
You ignored my questions and instead restated what you already said.
Sorry, didn’t mean to.
I want to backup probably daily and keep historic data ideally for the same time, but keeping data for up to a week (so e.g. 1 week of snapshots) is also probably fine if that is the intended way. My only concern is space on the backup pool when multiple big changes of files happen (e.g. transcoding many media files).
No worries, it is easy to loose track of the original question.
So what I am hearing is you need data that is up to date within a day and you need to keep it for a week. For example, when you go to restore you get to choose Monday’s data or Tuesday’s data.
If I remember right, you said that you are already taking snapshots on the production side every day and keeping them for 2 weeks. Then setup a replication job that runs daily and keeps the replicated copies for 1 week.
You mention that you are space constrained on the backup system. Remember that ZFS performance starts dropping off when any given zpool is above 70% full. The rate of performance drop varies with type of data, but once you hit 90% it will be very noticeable. Having said that, I ran my backup system until is was over 90% full when moving lots of data around on my production system. When that happened I manually deleted copies of datasets I had moved (and had backups of the new copies) and snapshots I no longer needed. That is always an option for when you make large changes in production data.
Just to make sure, since I’ve read something contradicting elsewhere.
My assumption is that when I have 10TB + 14 snapshots of each e.g. 40MB on the origin pool,
When I replicate the 14 snapshots to the backup pool (14*40MB) how exactly are the 10TB stored on the backup pool.
Do they become part of the oldest snapshot?
Does it replicate the 10TB + the 14 snapshots on top?
A snapshot references blocks of data and metadata that form a complete filesystem.
The first time you run any ZFS replication, whether or not you use TrueNAS, it cannot be an “incremental” stream. It must send the entire filesystem of the snapshot and all its referenced blocks (i.e, a state in time of your filesystem) that you select.
All subsequent sends can then simply transfer the differences of whatever two snapshots are specified, so long as the target shares a common snapshot with the source.
TrueNAS decides which two snapshots to do this with automatically under the hood. With plain ZFS, you must manually specify the two snapshot names.
Dataset BOX
Snapshot BOX@1 taken at 09:00
Snapshot BOX@2 taken at 10:00
Snapshot BOX@3 taken at 11:00
Snapshot BOX@4 taken at 12:00
At 10:30 you launch a replication job for this dataset. The job will transfer BOX@1 which will include all of the data in BOX.
At 12:30 you launch another replication job for this dataset. That job will transfer BOX@2, BOX@3, and BOX@4.
In terms of space used snapshots have two different values, USED and REFER. USED is the amount of unique data (+metadata) that comprise this snapshot. The REFER value is the total amount of data (+metadata) that this snapshot refers to.
If BOX is 1TB and each snapshot USED 100GB, then BOX plus all it’s snapshots would occupy 1.4TB (1TB + 0.1TB + 0.1TB + 0.1TB + 0.1TB).
The replication at 10:30 would transfer 1.1TB, the replication at 12:30 would transfer 0.3TB.
If each daily snapshot is consuming a lot of space because of how often you delete large amounts of data, then you should consider changing your workflow and dataset layout.