Help wanted for salvaging data from a faulted disk in a mirrored pool

lailakas · September 23, 2024, 5:44am

Hi all. I have a pool of 4 vdevs of 2 disk mirrors. A few days ago, one of the drives failed and changed to “Removed” state. I bought a new drive, swapped out the failed one and clicked “Replace”. The pool started resilvering, but soon I was notified that some sectors were not readable on the other drive in the mirror, and the drive is in “Faulted” state.

Now I have a completely failed drive, a damaged drive with data on it, and a new drive with no data.

After the resilvering finished, the new drive is reporting millions of checksum errors. I guess it means the data is not actually copied onto it? The resilver report says only 18G of data were resilvered, while I clearly have more than that. I tried to scrub the pool but I just see the number of checksum error increasing.
Here is a screenshot of current pool status

My questions are:

How to copy all data from faulted drive? the disk has bad sectors so who knows how long it will last. I want to get data out of there ASAP.
How to track down what files are corrupted? I do have additional backups of important files but not all.
How to remove those corrupted files to clear the error?

Johnny_Fartpants · September 23, 2024, 5:57am

Hello and welcome to the forums.

Can we see the output for zpool status and also smartctl -a /dev/sdc

lailakas · September 23, 2024, 6:04am

Sure, sorry for using screen shots. My keyboard doesn’t have a insert key.

Johnny_Fartpants · September 23, 2024, 6:09am

How is this system cabled? Is there a chance that when you replaced the failed drive you could have accidentally nudged a cable to this other drive? The situation doesn’t look good atm so I would check cables first.

lailakas · September 23, 2024, 6:19am

Thank you for your reply. The system is cabled within a server chassis. The server chassis has two sata backplates, each holds 4 drives in the front panel drive bay. Each pair of mirror is separated into these two backplates. I did pull out the working one and look at the serial number (when the system is powered off), but I don’t think it affects the cables. I’ll try to clean the dust, maybe it will help. Thank you for your suggestion.

What is the next best thing to do in your opinion?

Johnny_Fartpants · September 23, 2024, 6:25am

If you are able to access your data back it up asap (if not already) before anything else.
Check the cabling etc.
Run zpool clear and then run a scrub.
If things look better after that try to replace your failed drive again.

Worse case scenario here you have lost one drive in a mirror and the other is having issues. The saving grace is that your smart output for your possible failing drive looks fine to me hence why I’m thinking it could be a cable issue. Needless to say if two drives in a mirror have issues your data is at risk.

Final option and it may not be possible could you move these drives to another system to rule out hardware issues and perform a zpool clear / scrub and resilver in that environment.

lailakas · September 23, 2024, 6:28am

Oh, sorry for the confusion. The damaged drive is sdh. Sdc is the new drive that I bought and replaced into the system. The issue is somehow resilvering is not copying the data to it. The original failed drive has been removed from the system.

Johnny_Fartpants · September 23, 2024, 6:31am

Ah ok my bad that does change things then.

So you had a faulted drive and swapped it with now sdc. Started the resilver and during that process sdh failed before resilver could complete?

lailakas · September 23, 2024, 6:32am

Yes, that is correct. I think it happened almost right after running the resilver.
Below is the sdh

Johnny_Fartpants · September 23, 2024, 6:35am

In that case there is not much else you can do. Essentially you have lost two drives in a mirror and that results in a loss of the pool. Unfortunately we often see this scenario where one drive fails and another drive about the same age fails during resilver.

Can you still access the pool data? Do you have a backup?

lailakas · September 23, 2024, 6:39am

I haven’t found any corrupt data yet. That led me to think most of the data may be salvageable. I have backups for the important files, but not everything. The pool is too large (~36TB), and I do not have another machine with equal or greater capacity.

Johnny_Fartpants · September 23, 2024, 6:43am

Ok I think it’s safe to say this pool is toast so its now a data recovery situation. Appreciate the limitations in size but you need to take action now and backup everything you can and that you care about. Once complete the pool will need destroying and going again.

Perhaps RAIDZ2 next time?

lailakas · September 23, 2024, 6:44am

Will try deleting the vdev help? I didn’t go for Z2 because I couldn’t afford equal capacity drives. But well…

Johnny_Fartpants · September 23, 2024, 6:49am

Nope. No vdev no pool. Appreciate the reasoning around not using Z2 but hopefully you can see why it’s the most common vdev layout.

lailakas · September 23, 2024, 6:52am

Thanks a lot for your help

Johnny_Fartpants · September 23, 2024, 6:52am

No worries best of luck

NugentS · September 23, 2024, 11:33am

Actually…

Technically as its a pool of mirrors, you can delete a vdev. The trouble is that ZFS will attempt to vacate the data from the vdev by reading and then writing to the other vdevs - which presumably won’t actually work in this situation due to errors.

dan · September 23, 2024, 12:15pm

…but the pool has to be online and healthy for that to work.

etorix · September 23, 2024, 8:08pm

Sub-issue: Temperature 61°C!
You’re cooking your drives, and that could be part of the problem.