Deleted datasets and snapshots - am I completely stuffed?

NJMorf · February 9, 2025, 5:41pm

I just managed to almost completely delete my main volume’s datasets.

I wanted to duplicate some nested datasets from my SSD pool to my HDD pool. I snapshotted and then tried to restore that snapshot to the HDDs in a new location, but was told that was impossible because the source and destination volumes were different.

I then tried to set up a replication task using the GUI to do the same thing (a process I recently set up successfully for a single dataset, so I thought I was safe in doing so). I set the source as /mnt/ssd_1/apps and the destination as /mnt/volume_1.

The replication task created a snapshot for the parent dataset (“apps”, containing “plex”, “syncthing” and “qbittorrent”, all within my ssd_1 voume). I thought that the replication task would therefore recreate the same structure (i.e. /mnt/volume_1/apps/). When the GUI warned me that all existing snapshots would be deleted, I thought that it was referring to any potential (but not actually existing) snapshots in that new /apps dataset, so I confirmed and carried on.

The replication started, then failed because one of the datasets was “in use”. I looked, and every other dataset was gone, along with all of the snapshots (as the warning message had told me they would be).

No, I don’t have this data backed up elsewhere. To be fair, some of it is backups from other machines, and therefore is recoverable from those machines and their offline backups. The rest is the media for my Plex server, and therefore incredibly frustrating to lose but ultimately replaceable.

I did some panic googling and came across this old post: I followed it’s instructions; find the transaction number that started the destruction, then export the volume and attempt to reimport it with the last preceding transaction (i.e. the bad transaction minus one). This failed, saying " One or more devices is currently unavailable". I tried again with the last visible transaction in the zpool history for the pool, and I got the same error message.

I’ve shut the server down now. Is there any way I can reconstruct the lost datasets, or do I have to rebuild from scratch?

(Further note: the destroyed pool also contained the datasets for virtual machine system (which I wasn’t actually using) and the .bhyve dataset which I think was rendered redundant when I upgraded the server from Core to System a week ago. If I have to rebuild from scratch, how do I get TrueNAS to recreate any of these datasets it deems necessary?)

Thanks in advance.

NJMorf · February 9, 2025, 10:43pm

Update: I’ve been working in a VM to replicate whatever it is I did to cause this mess. I’m pretty sure this is the warning message I got:

Destination Snapshots Are Not Related to Replicated Snapshots

Destination dataset does not contain any snapshots that can be used as a basis for the incremental changes in the snapshots being sent. The snapshots in the destination dataset will be deleted and the replication will begin with a complete initial copy.

I only got this message when the snapshot was of the entire ssd_1 volume, not of any single dataset, and the result in the VM was to completely wipe all of the datasets and snapshots from the target volume, so I must have accidentally selected the entire volume on the real server too. I am, clearly, an idiot.

winnielinnie · February 9, 2025, 10:59pm

Let’s slow things down and start over.

Did you or did not you delete your main pool’s datasets?

You wrote that you “almost” deleted them. Then later you wrote:

Are these datasets actually destroyed?

You also managed to destroy an entire pool?

What? What does this mean? Did you do this in the GUI somehow? I honestly have no clue what you tried to do.

To do what? Are you implying you did not using the GUI previously?

Which pool did you lose everything? SSD pool? HDD pool? Both?

I’ll have to stop here, because I don’t even know how to ask the right questions. It’s not clear what’s going on or what you did.

NJMorf · February 10, 2025, 2:35am

I’m misusing the terminology. I have one pool consisting of two mirrored SSDs, and a second pool consisting of three vdevs, each a pair of mirrored HDDs. I meant that the datasets held on the HDD pool were deleted.

I used zpool history in the CLI to examine the transactions: there were many lines, referring to many datasets, using the term “destroy”.

This is my misuse of terms again. The HDD pool remained intact, but only one of the datasets on it was left after the attempted replication threw an error and stopped. No new datasets were created, and I assume no new data was written to the pool.

Using the GUI, in the Datasets screen, I selected one parent dataset (containing three child datasets) in the SSD pool and used the button in the Data Protection widget on the right to create a manual snapshot of the parent dataset, as described in the documentation.

I then opened the Snapshots screen, selected the newly created manual snapshot, and tried to use its Clone To New Dataset function to direct the clone to the other pool: I just changed the dataset name from “ssd_1/blah” to “volume_1/blah”, prompting the message Failed to clone snapshot: cannot create ‘volume_1/blah’: source and target pools differ. I realise now that clones can only be created on the same pool, so this was never going to work.

In the Data Protection screen, Replication Tasks widget, I added a new task using the Replication Task Wizard. Source and Destination locations were both set to “On this system”. I now believe that I accidentally included the ssd_1 volume in the Source selection, instead of just the parent dataset that I intended to select. For the destination, I set the HDD volume, volume_1 because, as I said originally, I thought that the replication would create a new dataset duplicating the source dataset (e.g. ssd_1/parent/ would become volume_1/parent/).

(In my post-disaster testing, I’ve come to understand that the replication process places the contents of the source dataset in the destination location, i.e. volume_1/, and I should have created a new, empty dataset to use as the destination.)

As I found out in my second post in this thread, by mistakenly including the ssd_1 volume in the source, I enabled the replication process to try to overwrite the entire contents of the volume_1 (HDD) pool. Had I selected only datasets in the source, I believe the replication would have worked as I had intended.

The SSD pool is unaffected. A single dataset remained in the HDD pool, all the others were destroyed/deleted - I’m not sure if those two terms are synonymous, so I may have confused matters further by using the wrong one.

Based on the old post I referred to at the top of this thread, I next tried to use the CLI to undo the damage.

zpool history -il volume_1 to find the transaction number of the first destroy action
zpool export volume_1 to disconnect the pool
zpool import -T 123456 volume_1 to try to import the pool in its pre-disaster state

Obviously this failed, hence my begging for advice. Rather than make things any worse, I shut the server down.

Thanks for your response. I hope there’s enough clarification here to go on with.

winnielinnie · February 10, 2025, 9:26pm

Other than a checkpoint (safe and reliable) or emergency import (not always reliable), your data and datasets are pretty much gone forever.

I’m still trying to understand how you managed to destroy the datasets on the HDD pool.

Did you choose the option “Replication from scratch” in the Replication Task?

If a replication cannot proceed because the target already contains data and there are no common base snapshots, it will simply abort. However, if you chose “Replication from scratch”, then it will go ahead and overwrite the destination.

Is it that the SSD’s dataset is the sole one that exists on the HDD pool now? Did you essentially replace everything on the HDD (include all child datasets) with a dataset from the SSD pool?

NJMorf · February 10, 2025, 10:32pm

Though I don’t recall exactly what I did, it seems likely that this is it. I won’t blame this on some random UI glitch, I’m reasonably sure it was all me. That said, I’m also reasonably sure that had I not accidentally clicked on the root of the pool in the source selection (something that’s easy to do when aiming for the expansion widget: a few pixels left and you click the checkbox, a few right and you click the pool name, and either action selects it), the replication would have refused to proceed regardless of the other options. Or, at least, that was the only way I was able to replicate the error in my VM tests.

Not quite. The single remaining dataset was one I’d set up for Windows File History. While the destroy process completed for all of the other datasets on the HDD, this one came back as being “in use” or words to that effect, and the replication process errored out. I guess that my desktop was coincidentally sending FH data to the drive at the time, otherwise it’d have been destroyed along with all the others. As it is, none of the SSD data was actually written to the HDD pool, not that that helps me now, it seems.

I guess I’ll just have to rebuild a new pool and set it all up from scratch, then wait a few years while my TV and Film collection rebuilds itself (mainly by recording broadcast TV). Oh well. I live and learn. Maybe.

Thanks again for the advice.

NJMorf · February 10, 2025, 11:05pm

OK, things are getting a little weird now.

I started the server back up again and the HDD pool had been imported, although when I tried to reimport it via CLI, it reported that it couldn’t import it because one or more devices were unavailable. I turned the server off at that point and haven’t turned it on again until now.

The HDD pool was still showing that it only had one dataset, the file history one. I recreated a Cloud dataset, for which the Samba share is still in place, but I guess I’ll need to delete and recreate the share because it’s not showing up from my desktop.

However, when I looked at the server’s shares from my desktop, the new Cloud share wasn’t visible. The Backup folder and it’s previous contents are visible, even though TrueNAS doesn’t show a dataset for it. I tried to recreate the dataset, assuming it’d recreate as an empty set and probably wipe all that data anyway, but TrueNAS refuses to let me, saying that the dataset already exists.

If I look in /mnt/volume_1/, there is indeed a Backups folder, and it has (some of) the content it had before the cockup, so I’ve created a temporary dataset and moved that data into it. Now how do I wipe the records of the old dataset so that I can recreate it?

(Edit: deleting /mnt/volume_1/Backups seems to have done the trick.)

Unfortunately, none of the other datasets seem to have been accidentally preserved in this manner: /mnt/volume_1 only includes the not-screwed-up FileHistory dataset/directory and the newly recreated empty Cloud and Media directories.

NJMorf · February 12, 2025, 10:11pm

Final update, I think.

After a lot of permissions issues, I finally got all my apps working again and able to read/write/modify all the files they needed to. None of the data in that randomly preserved Backups folder seemed to be current: when I ran the appropriate backup tasks on my PCs, they insisted on replacing everything on the server anyway.

So, as expected, I’m back up and running with the exception of all of my TV and movie files, which is a real pain in the backside but not really a problem. I have, at least, learned not to make that particular mistake again.

Here’s to my next mistake!