But we could solve that by discarding a checkpoint and creating a new one, and set it to run say, once a day. It seems that there are no drawbacks of setting things like this ?
It means that in all cases I’ll have one checkpoint that is few hours new.
You’ll also have to remember to disable it if you want to modify your vdevs, such as offlining disks.
You’ll also need to disable the task when an emergency happens. The reason is that the task could replace the “good” checkpoint with a “bad” one, if gets triggered after the destructive events.[1] For this reason, you’ll want it to only create a checkpoint during an hour where you’re not really using the NAS, nor is it under much activity.
First thing’s first: Create a Periodic Snapshot Task for every important dataset. This is your first line of defense.
“Oh no! Bad thing happened. Thankfully I have a checkpoint from yesterday which I can rewind to! Uh oh! The task just created a new checkpoint now! Looks like cannot rewind to the good checkpoint anymore.” ↩︎
I just realized, you might have also deleted some system files that were not “busy” or “read-only” protected.
Nothing to do with your storage pool, but a fresh install might give you peace of mind. (Exporting your config file to be able to import it again after installation.)
Up to you.
@HoneyBadger, do the system files and directories get re-loaded from a read-only snapshot after every reboot? Is there a chance that there were some casualties from the rm -rf / command that involved files on the boot-pool?
I’ve just set some snapshots following the Stux YT videos.
I indeed did a fresh install. The boot-drive i’m using is the one I used to install windows and test the klennet scan operation.
Afterwards I did a fresh install of TrueNAS from which we did the recovery.
But I will have to do it again because i read after typing :
dmesg -w
root@truenas[~]# 2024 Dec 1 13:27:54 truenas Device: /dev/sda [SAT], WARNING: A firmware update is available for this drive.
2024 Dec 1 13:27:54 truenas It is HIGHLY RECOMMENDED for drives with specific serial numbers.
2024 Dec 1 13:27:54 truenas See the following web pages for details:
2024 Dec 1 13:27:54 truenas https://www.corsair.com/en-us/force-series-ls-60gb-sata-3-6gb-s-ssd
2024 Dec 1 13:27:54 truenas https://www.smartmontools.org/ticket/628
I checked and my drive is in the range of drives that needs a firmware update.
So I will export the pool, power off the system and take that drive and upgrade its firmware.
But I think now that I recovered everything set a checkpoint, this is peace of cake !
Important note for future readers: OP got lucky. Don’t rely on luck as your backup strategy!
much as we all love ZFS, this wasn’t some magic feature of ZFS, more of a quirk of how it operates, which saved the day.
The point purpose of a checkpoint is to recover from a disaster (or change) that goes beyond datasets and snapshots.
What @HoneyBadger essentially did is a rewind the pool to a common TXG that was not deliberately or cleanly saved via a checkpoint. It was a last resort, which did not guarantee success, and takes longer to rewind. (It’s not something anyone should ever rely on.) Nor will this “emergency TXG” persist. Whereas a checkpoint will hold, no matter how many TXGs have happened since its creation.
If @Berboo had an automatic “midnight” checkpoint that was saved anywhere from November 16 to 18, he could have simply exported the pool and reimported it with zpool import -R /mnt --rewind-to-checkpoint.
EDIT: I can’t speak for @Berboo, but I would guess he’s fine losing everything that would have existed after a November 18 “midnight” checkpoint, if it means he can rewind back to midnight on November 18. Considering that the alternative is losing everything.
When the checkpoint feature was first implemented and released you could not remove a checkpoint except by exporting and reimporting your pool.
Is this not the case any longer?
If the answer is “yes, you can remove a checkpoint from a live pool”, I guess I’ll go investigate creating a midnight snapshot checkpoint every day. In addition to all the snapshots and off-site replication I already do.
From following this discussion I guess you really can do that and I somehow missed the memo.
To discard, it’s as simple as zpool checkpoint -d mypool[1]
EDIT: Just keep in mind that you cannot issue a new checkpoint if one already exists. So you’ll have to discard the current one before you can create a new one.
A command that runs at 03:00 every night could look like this:
For an automatic job/script, it might be better to use -w with -d, so the discard command will only exit after completion. Then the checkpoint creation should work with a simple ; between both commands. ↩︎
Agreed, but note that in this case even local snapshots would have saved the day as there would have been something to revert to.
One should get the basics right first.
Level 0
Create pool and at least one dataset.
Level 1
Snapshots.
Scheduled SMART tests and scrub.
Level 2
At least one form of backup.
Level 3
Full 3-2-1 backup scheme, with external and offline backups.
A daily checkpoint may be a nicety, but is not even a satisfactory alternative to completing Level 1.
Just for future reference you might want to do some error checking check on if the variable is NOT empty next time. if [[ ! -z "${DB_DATA_LOCATION}" ]] then echo "deleting data for \'${DB_DATA_LOCATION}/\' ..." rm -rf "${DB_DATA_LOCATION}/*" RESULT=$? if [ $RESULT -eq 0 ] then echo "Successfully deleted \'${DB_DATA_LOCATION}/\'" fi else echo "The directory is empty!" exit 1 fi