How to recover from rm -rf /* ? (lost everything)

Ok I understand.

But we could solve that by discarding a checkpoint and creating a new one, and set it to run say, once a day. It seems that there are no drawbacks of setting things like this ?

It means that in all cases I’ll have one checkpoint that is few hours new.

Or am I missing something ?

Edit :

Yes exactly

Only through a non-middleware, non-GUI script.

You’ll also have to remember to disable it if you want to modify your vdevs, such as offlining disks.

You’ll also need to disable the task when an emergency happens. The reason is that the task could replace the “good” checkpoint with a “bad” one, if gets triggered after the destructive events.[1] For this reason, you’ll want it to only create a checkpoint during an hour where you’re not really using the NAS, nor is it under much activity.

First thing’s first: Create a Periodic Snapshot Task for every important dataset. This is your first line of defense.


  1. “Oh no! Bad thing happened. Thankfully I have a checkpoint from yesterday which I can rewind to! Uh oh! The task just created a new checkpoint now! Looks like cannot rewind to the good checkpoint anymore.” ↩︎

1 Like

I just realized, you might have also deleted some system files that were not “busy” or “read-only” protected.

Nothing to do with your storage pool, but a fresh install might give you peace of mind. (Exporting your config file to be able to import it again after installation.)

Up to you.

@HoneyBadger, do the system files and directories get re-loaded from a read-only snapshot after every reboot? Is there a chance that there were some casualties from the rm -rf / command that involved files on the boot-pool?

1 Like

I’ve just set some snapshots following the Stux YT videos.

I indeed did a fresh install. The boot-drive i’m using is the one I used to install windows and test the klennet scan operation.

Afterwards I did a fresh install of TrueNAS from which we did the recovery.

But I will have to do it again because i read after typing :

dmesg -w
root@truenas[~]# 2024 Dec  1 13:27:54 truenas Device: /dev/sda [SAT], WARNING: A firmware update is available for this drive.
2024 Dec  1 13:27:54 truenas It is HIGHLY RECOMMENDED for drives with specific serial numbers.
2024 Dec  1 13:27:54 truenas See the following web pages for details:
2024 Dec  1 13:27:54 truenas https://www.corsair.com/en-us/force-series-ls-60gb-sata-3-6gb-s-ssd
2024 Dec  1 13:27:54 truenas https://www.smartmontools.org/ticket/628

I checked and my drive is in the range of drives that needs a firmware update.

So I will export the pool, power off the system and take that drive and upgrade its firmware.

But I think now that I recovered everything set a checkpoint, this is peace of cake ! :sweat_smile:

2 Likes

The boot-pool is by definition mounted in the boot drive right ? So after the fresh install of TrueNAS this has already been dealt with, right ?

Important note for future readers: OP got lucky. Don’t rely on luck as your backup strategy!
much as we all love ZFS, this wasn’t some magic feature of ZFS, more of a quirk of how it operates, which saved the day.

5 Likes

Correct. :+1:

If you did a fresh install, then this isn’t an issue.

1 Like

I thought -N meant don’t mount?

sudo zfs mount TankPrinicpal may fix.

Don’t think so, but reactivating a previous boot environment then re-upgrading would resolve that.

The issue is you can only have one…

Which means to make another you are always at some point, uncheckpointed.

Seems, the best thing is to have that point be some time when you’re not actively messing up your system :wink:

The other issue is that on a multi-user system, I’m not sure how you could ever use a checkpoint… maybe as part of a rollback from backup.

Replicated snapshots are the real backup.

Agreed.

The point purpose of a checkpoint is to recover from a disaster (or change) that goes beyond datasets and snapshots.

What @HoneyBadger essentially did is a rewind the pool to a common TXG that was not deliberately or cleanly saved via a checkpoint. It was a last resort, which did not guarantee success, and takes longer to rewind. (It’s not something anyone should ever rely on.) Nor will this “emergency TXG” persist. Whereas a checkpoint will hold, no matter how many TXGs have happened since its creation.

If @Berboo had an automatic “midnight” checkpoint that was saved anywhere from November 16 to 18, he could have simply exported the pool and reimported it with zpool import -R /mnt --rewind-to-checkpoint.

EDIT: I can’t speak for @Berboo, but I would guess he’s fine losing everything that would have existed after a November 18 “midnight” checkpoint, if it means he can rewind back to midnight on November 18. Considering that the alternative is losing everything.

1 Like

When the checkpoint feature was first implemented and released you could not remove a checkpoint except by exporting and reimporting your pool.

Is this not the case any longer?

If the answer is “yes, you can remove a checkpoint from a live pool”, I guess I’ll go investigate creating a midnight snapshot checkpoint every day. In addition to all the snapshots and off-site replication I already do.

From following this discussion I guess you really can do that and I somehow missed the memo.

Kind regards,
Patrick

1 Like

You meant a checkpoint right ?

Oops - yes, of course.

1 Like

To discard, it’s as simple as zpool checkpoint -d mypool [1]

EDIT: Just keep in mind that you cannot issue a new checkpoint if one already exists. So you’ll have to discard the current one before you can create a new one.

A command that runs at 03:00 every night could look like this:

zpool checkpoint -d -w mypool; zpool checkpoint mypool

  1. For an automatic job/script, it might be better to use -w with -d, so the discard command will only exit after completion. Then the checkpoint creation should work with a simple ; between both commands. ↩︎

2 Likes

Agreed, but note that in this case even local snapshots would have saved the day as there would have been something to revert to.
One should get the basics right first.

Level 0

  • Create pool and at least one dataset.

Level 1

  • Snapshots.
  • Scheduled SMART tests and scrub.

Level 2

  • At least one form of backup.

Level 3

  • Full 3-2-1 backup scheme, with external and offline backups.

A daily checkpoint may be a nicety, but is not even a satisfactory alternative to completing Level 1.

6 Likes

Speaking of levels. Somebody may be interested in

Just for future reference you might want to do some error checking check on if the variable is NOT empty next time.
if [[ ! -z "${DB_DATA_LOCATION}" ]]
then
echo "deleting data for \'${DB_DATA_LOCATION}/\' ..."
rm -rf "${DB_DATA_LOCATION}/*"

RESULT=$?
if [ $RESULT -eq 0 ]
then
echo "Successfully deleted \'${DB_DATA_LOCATION}/\'"
fi
else
echo "The directory is empty!"
exit 1
fi