UPDATE: This is not the actual feature request, since there is now a dedicated category for feature requests. I’ll try to write up a proper, comprehensive feature request soon.
I never really knew anything about checkpoints or even heard of them until recently. After being intrigued enough and reading what a checkpoint is, how a checkpoint can save you from self inflicted catastrophe, I think this is a good topic and suggestion to revisit.
I think it is an excellent idea to have a button on the top of the Storage section open a page where checkpoint operations can be made from the GUI.
By opening a page, you can have a short explanation as to why/when a checkpoint should be used, how it is not a snapshot (link to snapshots for those people in the wrong place), and have the relevant commands as applied to Truenas as checkboxes to check to create, to destroy checkpoints.
Maybe add a auto remove the checkpoint after a reasonable time period. After all a checkpoint is just a safety net when doing something potentially risky with a server and it would be handy to have this tool in the GUI, but since it goes stale fast you don’t want to keep the checkpoint around.
Just recently a new forum user accidentally deleted all or at least a lot of stuff that had no backups from their server by using the command rm -rf ${DB_DATA_LOCATION}/*
which was found on the web, forgetting to fill in the path and Bash started deleting everything and maybe not realizing the potentially destructive power of rm. The person at least knew enough to realize the mistake, and come to the forum for advice/help before making the issue worse.
If there was more education about what a checkpoint is, awareness of when it should be used, and most importantly, an easy way in the GUI to make and manage a checkpoint, it may have been much easier and the odds much higher to help guide the person in how he might get his deleted data back.
You can say “well they should have known what the command did or been more careful” Yea, you can say that but I think many of us have done something considered bone headed at one point or another and wished there was a way to rewind the mistake. If the product I used could save my skin, I would be really happy about it and very likely promote the product to others as the best to use.
I kept hearing mention in the forum about this thing called checkpoint and looked up some info on the forum and an article from Serapheim Dimitropoulos. I read your guide and feature request and personally I think it is a very good tool to have. Though one thing I don’t find reference to is how much space it takes and how long it may take to generate. My test VM it took a small amount and finished quickly. On a system in use with 32TB of data, 80TB total space system It was taking awhile so I stopped the process. Time/space taken to generate might be something people should be aware of.
That’s interesting. Every time I created one, it happened instantly.
Sure, my main pool is only consuming 5 TiB, but creating a checkpoint happens as fast as creating a snapshot.
I don’t see why 32 TiB would “take a while” to such a length of time that you needed to abort the process?
Like a snapshot, it takes up no extra space, initially. As time goes on, and things are changed and destroyed, the checkpoint will hold on to the data that would otherwise be forever gone without it.
Unlike a snapshot, you don’t want to sit on a checkpoint for too long, anyways. One reason is it loses its main function after much time has passed, since to rewind to a checkpoint will revert your pool back to that state in time, and you’ll lose everything afterwards.
Checkpoints should be a feature that every user should consider using before engaging in a major NAS change/upgrade, be it hardware or software. Arguably, snapshots do the same thing but Checkpoints are superior in that you know that everything is as it should be as of the time of the checkpoint.
But even if there’s a lot of activity, a snapshot (done at a dataset level) is still nearly instantaneous. I would assume the same for a pool checkpoint.
I never heard of someone creating a snapshot for a dataset undergoing much activity, only for them to “abort” the snapshot creation because it’s taking too long.
I thought it was quick and it was on an EE test system with very little data. I then created a checkpoint on a production system not under much active use, took a look at what size on disk it was, but then I went to delete it a couple minutes later I saw the space used had increased. Waited a couple minutes and the space used had increased, so it was obviously still working on creating it. I went ahead and issued the delete command, and things went well with the delete and it was almost instant. I’m not at the server for several days where I can try again, or do a complete test, (well I could but don’t want to do it remotely).
I love checkpoints and I fully support the suggestion that they should find their way into the TrueNAS UI.
The key part with checkpoints is knowing when to take and discard them as you can only ever have one currently at any one time. Having said that I see no reason why you couldn’t have some automation monitoring the size of your checkpoint and when it exceeds a certain threshold it discards and takes a new one. That threshold could either be a predefined based on percentage of your current free space or perhaps something you could even customise as the end user based on your requirements.
I personally manage this via cron and take a checkpoint every week but each user would need to consider their personal requirements.
I’d love to see it become a default that just happens after pool creation as it would save so many people.
I think this would make the best compromise to benefit everyone:
TrueNAS defaults to creating a new checkpoint once per week
Can be changed to a user’s preference (once per day, once per month, twice a week, …)
Can be disabled for a pool
There will still be a button to manage, view, delete, and create a pool’s checkpoint
Here’s the rationale behind this:
You don’t want to set its lifespan too long, since after a good amount of time has passed, rewinding back to a checkpoint loses some of its practicality.
You don’t want to set its lifespan too short, since there’s the possibility of it getting “replaced” by the time a user decides to rewind their pool after an emergency or disaster. (Let’s say you only set its lifespan for “one day”. That means a new checkpoint replaces the old one every day. In the real-world example that I linked, the user’s checkpoint might unfortunately be replaced the next day, before they had a chance to use it for their emergency rewind. A default lifespan of one week increases their chances of a “pre-disaster” checkpoint existing.)
Finally, this would only be the default. (A new checkpoint every week.) The user can still change this to a different frequency, or even just disable the “auto checkpoint” feature.
Whether or not a user opts for the “auto checkpoint” feature, they can still manually manage it via a button in the pool’s page.
Check points should not be auto-cleared because checkpoints can be another tool to prevent ransomware from over-writing ‘good’ snapshots. Clearing a checkpoint should require a higher level of authorization than SMB access / whatever.
There’s more nuance to this than how we think of snapshots.
The obvious thing to accept is that having no checkpoints is always worse for the user, in terms of recovery, than having “auto checkpoints”, which are replaced on a regular basis.
Then there’s the idea that one can decide if they want “auto checkpoints” (as described above), or if they would rather manage a pool’s checkpoint manually. If done manually, it’s up to them to discard/replace a checkpoint with a new one.
So for TrueNAS to periodically create a new checkpoint once a week is already beneficial to safeguard a user’s data (and mitigate against an unrecoverable “mistake”) than to have no automatic checkpoints.
If a checkpoint does get discarded and replaced with a new one, in which they cannot rewind back to the “good” checkpoint, then it’s no different than having no checkpoints (which is already the situation with TrueNAS.)
EDIT:
Take this “non-GUI” method of “auto checkpoints”. It’s at least something. Had @Berboo taken weekly checkpoints, at least they’d have a chance to safely rewind their pool back to a state before they recursively deleted all folders.
You might ask, “But what if the weekly task replaced a good checkpoint from a few days ago with a new one?”
In that case, they wouldn’t be able to rewind their pool back to a “good” state. But how is that any worse for them as compared to having no checkpoints at all? They’d be in the same bad situation.