Salvage Degraded Data VDEV (Maybe?)

Zofran777 · October 13, 2024, 5:18pm

Hello All!
I am running TrueNAS Scale and have been running it for about 2 years at this point. Approximately a year ago I finally got everything running as I wanted and I kind of walked away and dove into other projects that I wanted to work on(Classic Mistake for a tech geek with ADHD like me). Fast forward about 6 months and I had some sort of drive error which put my ssd pool into a degraded state but still running and working fine as it was only one drive. I ordered a replacement drive but when I went to replace it, started up the system and the drive that had supposedly failed was working with no issues again. Ran for a while, everything fine, shutdown and restart a few times and boom its acting like it is bad again. I replace the drive with the new one and I keep having issues with the pool randomly going into a degraded state. This is where I make my biggest mistake, somewhere along the way seeing as the pool was still functioning, I left the 5th drive (5 wide RAIDZ1) that seemed to be the issue out in order to cross my fingers and backup the data, but life happened and I never went back to add the drive back etc., and the system kept running without issue.
So here we are another 6 months later and we have an area wide power outage. TrueNAS fails to automatically come back up so I manually restart it, but this is when I notice I am showing another drive in DEGRADED state, my apps are not functioning etc. I restarted via the web GUI and it won’t come back. Yesterday a spend half the day trying to get it to boot with no success (I may have been using the wrong instance??), but then out of the blue it boots up and I am somewhat back to functioning as I was before the power went out. Most apps are up and running, my VM that boots on startup seems to be running, however some apps are not working and I still have the drive issue.
Basically, I realize I am a total jackass for doing this the way I did it an honestly I will be very surprised it I have not totally lost all my data on the ssd pool. I totally understand and make no excuses about that, but now I am trying to figure out how to best move forward, save anything if I can and get my server back to the state where I want it and can keep up with regular updates etc.
Obviously a 5 wide VDEV with only 3 functioning disks is totally lost and I know and understand that, however the only thing that is confusing me is the fact that this is the error message I am getting and what things look like in the storage dashboard…

Any suggestions, explanations, tips on how to best move forward?

TrueNAS Version: 22.12.3.1
Motherboard: Asus TUF Gaming X570-PLUS(WI-FI)
RAM Qty: 128g 4 stick
CPU Make/Model: 5950X
NIC: - Motherboard plus a couple of USB 2.5gb external NICs
Hard Drive(s) Make/Model:
5 Team Group EX2 2TB SSD
5 10TB spinning drives
2 112gb NVME Metadata VDEV for the spinners

argumentum · October 13, 2024, 5:28pm

Replace the unavailable drive and put a time limit to the snapshots.
Don’t let a ZFS thing go over 80%, it does not like it. I’d say 50% is better as snapshots are not magical ( they take space ).

Have no fear
Cheers

Zofran777 · October 13, 2024, 5:46pm

Is your thinking the degraded drive is somehow related to the snapshot overload issue?

Zofran777 · October 13, 2024, 7:50pm

In addition to my last question related to your solution I have a couple more if you don’t mind. I just want to make sure I am understanding as I had a very solid understanding a year ago and I feel I have lost everything!!!

Does it make sense to run a replication task to back up the current data onto my spinning drive pool, before I try to add the new disk back in? My fear is that something will fail during the resilvering process and I won’t be able to get anything back.

Also, I have to do some reading as again I have lost so much, but a snapshot is not necessarily a backup(additional copy) of the data, correct? It is just basically a history of changes? Is this typically stored on the same pool/vdev or a different one?

That is one of the other odd issues I am seeing, all of my snapshot tasks and replication tasks seem to be gone so I don’t remember what was running, when, and to where it was going.

I am just really confused as to how this 5 drive array is not totally down with one totally missing drive and one drive showing degraded…

argumentum · October 13, 2024, 10:29pm

Resilvering will read the whole drive and so does copying the whole dive elsewhere. So do the copy if your fear tells you. Fear is not stupid. Fear has doubts. But it should be with the same level exposure of the degraded hardware.

No

Yes. The history and the data occupying space. If you deleted that 2 weeks ago, …let it go. Is a just in case rather than a backup.
And yes, is kept in the same dataset.

Yeap, gotta love TrueNAS. Is quite good stuff.

These NVMe drives die faster than SSD, and SSD faster than HDD on heavy usage. Do set the email warning up so that a next time you’re not in such tight spot. Also RaidZ2 is a beautiful thing to have when you have 5 or more drives.

In regards of snapshots and the 50% in ZFS. Say you get a cryptoVirus and overwrites every file…, it’s ok it there was enough space to keep both the new and the old.

sfatula · October 14, 2024, 12:02am

107,000 snapshots!!! Wow!!! Definitely too many! My system running for many years has an entire 1,176.

Just so you know, if there is a drive error, best to handle it that day. Not wait. That’s what alerts are for.

Also, you always need backups, no matter how you setup your pool(s)

Finally, if it’s truly a missing label, sometimes those can be fixed. I personally have never done so, however, I’ve seen others here say there is a backup label somewhere so maybe they can help you with that. If that is actually the issue. I’d search the forums for missing labels and see if you can contact the folks who help with that.

Stux · October 14, 2024, 3:37am

Can you put this drive back?

Zofran777 · October 15, 2024, 4:59pm

Yes, I can stick the drive back in, but the issue is that I was having inconsistent results with both the original drive that I though failed and the replacement drive that was also acting funny when I replaced the “failed” drive and rebooted.

The issue is that all of this happened about 6 months ago and I have been running with only 4 drives since then…so I am not sure if the array will recognize either drive as the data will obviously not match up with the up to date data. However I will say that the changes have been very minor over the past 6 months. One last little tidbit as I am not sure if this info was in any of the screen shots, but only like 1tb of almost 8 is actually being used in this pool.

Anyway, ill probably bite the bullet, test the drive I have and stick it back in and hope the resilver is a success. Not much else I can do.

Thanks again for all the help and advice! I will report back with my results.

Zofran777 · October 15, 2024, 5:04pm

Oh, Yea, one more question I just remembered.

How can I get rid of, remove, whatever I need to do, most of the old snapshots?

The webGUI seems to lock up when I try to look at snapshots so I am assuming I am going to need to do it via ssh into the system but I am not sure of the best and safest way to go about that.

sfatula · October 15, 2024, 6:50pm

I think @winnielinnie had an easy method to do so with cli.

winnielinnie · October 15, 2024, 7:47pm

First and foremost, create a checkpoint for your pool.

zpool checkpoint <poolname>

Then you’ll want to list out the snapshots in TXG order. (Which is the default, anyways)

Even though you can do this recursively, it’s safer to just do it at a per-dataset, step-by-step process.

Yes, it’s more “manual” and will take more time and require more redundant steps. But when it comes to data, it’s better to go slowly rather than go crazy with recursive, bulk actions.

zfs list -t snap -o name,used,refer,createtxg -s createtxg <poolname>/<path>/<to>/<dataset> | less

The command might appear to “hang” the first time you run it, since it needs to fetch and populate this list.

The oldest will be at the top of the list, and the newest will be at the bottom. This also assumes you’re only looking at the snapshots of a single dataset.

Use PgDown, PgUp, ▼, and ▲ to scroll and assess the list. Press Q to exit the scrollable less view.

After you’ve decided on the “range” of snapshots you wish to prune, you can do a “dry-run” to simulate deleting the sequential batch. (I’m using fake snapshot names and dates for this example.)

For my example, I choose to delete everything from @auto-2022-01-01 to @auto-2024-06-30

zfs destroy -nv <poolname>/<path>/<to>/<dataset>@auto-2022-01-01%auto-2024-06-30

The % symbol instructs the zfs destroy command to delete all snapshots from auto-2022-01-01 to auto-2024-06-30.

If you feel comfortable, remove the -n flag, which will delete the snapshots this time, without a “dry-run”.

zfs destroy -v <poolname>/<path>/<to>/<dataset>@auto-2022-01-01%auto-2024-06-30

You can repeat this again for the other datasets that need pruning of their snapshots.

Shortly after, if everything looks good and you’re happy with the results, you can safely remove the checkpoint.

zpool checkpoint -d <poolname>

Stux · October 15, 2024, 9:03pm

I always found space the simplest way to do page down

Zofran777 · October 17, 2024, 4:00pm

I have been slowly working through some of this stuff to ease myself back into working with my system and I have run into a few related questions.

First, I do have a backup already in existence on my bigger data pool and I just did a quick “ls -l” (tried to include output but for some reason it was not allowing me to add images to my response) of both the existing data and the backup data and obviously at a very high level they are the same, which obviously does not say much, but I was wondering if there was a command that I could run that would compare these file systems and show the variations between the two? That way I can get an idea of how accurate this existing backup is etc.

Second, and really the more important and relevant question, is how can I approach cleaning up these older snapshots when I have a huge chunk of snapshots at every single level of the file system? Do I literally have to go through dir by dir and do the above process(zfs list, zfs destroy) or is there a way to 1 show all snapshots within a file system and 2 clean them up on that same higher level? Hopefully that makes sense?

This is one of the things that I always struggled with and even now when I go back and look at creating snapshots or running replication tasks(which basically seem to rely on snapshots) I run into some confusion about the various levels at which individual snapshots are being taken etc. which obviously is partially what got me into the mess I am in with well over 100k snapshots.

Any further direction or suggestions would be greatly appreciated!

winnielinnie · October 17, 2024, 5:22pm

“Every single level of the file system”? What do you mean by this? Each dataset is a filesystem. Snapshots are a property of a dataset.

Did you instead mean “every single snapshot that exists across all datasets in the pool”?

“Dir by dir”? Do you mean “dataset by dataset”? If so, please refer to this from my previous reply:

I don’t want to share how this can be done “recursively” (it’s easy to find out yourself), because destructive recursive operations are very dangerous, and it’s not far-fetched for someone to accidentally destroy something they “didn’t mean to”.

That’s why when it comes to destroying snapshots, it’s much safer to approach this “dataset by dataset”, even if it’s more tedious.

This is possible with the zfs diff command.

Find the newest snapshot on the destination dataset:

zfs list -H -t snap -o name -s createtxg <backpool>/<dataset> | tail -n 1

Then use that snapshot’s name (let’s pretend it’s @auto-2024-05-01) as the first and only argument for the zfs diff command, to compare this snapshot (on the main pool) to the live filesystem (also on the main pool):

Try to avoid writing new data to the filesystem as you’re doing this.

zfs diff -FHh <mainpool>/<dataset>@auto-2024-05-01

If you want to save the differences as a text file:

zfs diff -FHh <mainpool>/<dataset>@auto-2024-05-01 > /path/to/differences.txt

You can now inspect differences.txt.

+ means “added since”
- means “deleted since”
M means “modified since”
R means “moved/renamed since”

BIG FAT WARNING: Pay close attention to “spaces” in your commands. A wayward “space” can be the difference between deleting a subfolder or deleting an entire directory tree.

Create a pool checkpoint before you partake in any destructive actions. You can remove the checkpoint later.

rm -r Documents/ temp/old

rm -r Documents/temp/old

Spot the difference? If not, say goodbye to all your documents!

Stux · October 17, 2024, 5:26pm

I demo a number of uses for snapshots and my tiered snapshot strategy in this video

It might help your understanding

winnielinnie · October 17, 2024, 5:36pm

Snapshots are a dataset property.

If you want a non-technical visual on how they work at a per-dataset level, you can check this out.

Zofran777 · October 21, 2024, 4:29pm

Another update along with a few more questions. Thanks again for bearing with me on this!!!

First, I replaced the missing disk, resilvered and seem to be running fine with 5 active working disks now, however the one is still showing degraded…

Not really sure how to fix this issue or what is actually wrong with the disk/data etc. My only thought is to pull the disk, format it and put it back in to resilver… But I am fairly sure that not the right way to go about this process and a last resort

So any suggestions or advice on that would be great!

Second, When it comes to the snapshot issue, the quantity has gone down dramatically from where it was but still showing 46,125 in the alert.

I have made an attempt to go through and trim some snapshots, dataset by dataset, but the numbers were not adding up. Based on the structure of my pool there still seem to be way more snapshots hiding somewhere.

The problem seems to be that there are snapshots at basically every level of every dataset.

For example if I look at flash/ix-applications/releases/code-server/charts there are multiple pages of snapshots. Basically that theme continues throughout the pool. At every level there are basically 2 weeks worth of hourly snapshots (that is how I had my snapshots setup) which as you can figure, adds up very quickly.

Going through level by level and eliminating these is extremely time consuming borderline impossible. However I did change the snapshots to run only once a day for now so I am assuming within 2 weeks the quantity should go down significantly? Rather than fighting with trying remove all of them manually should I just wait out the time and see if the issue resolves itself?

Third, and maybe my real core issue comes down to understanding what I am possibly doing wrong to end up with so many snapshots? Is it possibly that I have my datasets designed/setup poorly or is it just that recursive hourly snapshots set to be kept for two weeks is just overboard and the main issue I am having.

Zofran777 · October 21, 2024, 4:35pm

Just for a little context here is a screenshot of my datasets for this pool.

Stux · October 31, 2024, 9:49pm

IMO, hourly for 2 weeks is too many. Again, see my video on tiered snapshots.

Regarding your degraded pool… we need a bit more info to determine the right course of action.

Can you paste the result of sudo zpool status