Replacing new disk that is taking forever to resilver

Hi, we had a failing disk in a pool, have spare bays so chucked in a random spare we had, replaced and the resilver started. However its been running for a week now and has no estimated completion time. Is it safe to remove the new disk we put in (leaving the old failing one) and put another disk in we know is better? Will it replace and resilver etc?

What’s your setup? RAIDZX? How many drives, how big, how much data…

Please add your detailed hardware setup including how the drives are connected.

What does the disk IO show in the reporting section?

Edit: what is the output of zpool status?


see above, around 28 disks

used capacity at around 55TB
Disk i/o is very slow on the new disk we have put in:
image
Top old disk, bottom replaced disk

That is awfully slow, you can monitor the progress but it does not seem like it’s actually stuck.

Do you have a known good, i.e. property burned in drive you could try to use alternatively? Maybe the other spare drive?

I don’t know how to abort the resilver process on top of my head. But with 13 (!!) mirrored vdevs you need to act quickly unless you have backups.

What is the exact drive model you used as a replacement and what are the existing drives?

Yeah it is progressing but like you say, very slow.Have another drive coming in today - cant remember off the top of my head the model of the drive we put in, was an old one we had spare at the time. with mirror 3, is it possible to add in the extra drive we have coming and remove the one we added (the old slow one)? We’d still have the 1 drive in the mirror that is working and online as well as the original faulty one in the system - just be removing the new slow one?

That’s what I would suggest. What about the other spare, isn’t that burned in / tested?

I do not want to I’ll advise you with so much data at risk so I’d wait for someone else to chime in, maybe @joeschmuck knows how to do this?

We’ve had historical issues with those im not fully clued up on so avoiding using that

Having spares attached to such a large mirrored vdev pool that had issues and are to be avoided is a recipe for disaster.

Get replacement drives (CMR), burn them in and replace the spares.

Just be clear: if a single vdev fails the whole pool dies.

Yep, its what ive been left with so in the process of attempting to get it all sorted

These are the symptoms that block and files and serve the home publicized when WD surreptitiously replaced CMR drives in their Red ‘NAS’ product family with SMR drives. Ridiculously-long resilvers ensue. Other OEMs have followed suit shoveling SMR drives into all product families they ship since SMR drives are 20% less expensive to produce per TB than CMR drives.

Big data centers don’t care because their systems are designed around this (ie IIRC backblaze has multiple copies of your data across multiple pools so if one pool goes down for a 2day resilver, they just use the other pools to serve you).

It is possible the faulty disk, if not offlined, could cause the resilvering process with a new drive to fail.
Until resilver is complete, ZFS will attempt to finalize resilvering of the disk. It can cause replication to stop.
Upon restart of the resilvering operation, which is automatic, ZFS will check where it left off and resume from there. Not sure if there is a log that indicate when ZFS is restarting resilvering.
If resilvering is constantly being reinitialized, maybe it would be better offlining the faulty drive, but this is not without risk dur to your pool configuration.

2 Likes

@chuck32 Thanks for the vote of confidence however large arrays of drives, I have no personal experience. But it does sound a bit like SMR.

To check if you have an SMR drive, run this script, it is nice to have and run at least once to make sure you are good.

1 Like

Just to update on this, detaching the new (faulty) disk stopped the resilvering process. We then added a brand new CMR disk which we knew was working and began the replacement again from the old faulty drive to the new one. This is now nearly complete so all does work fine doing it that way

1 Like

Mirror resilvers are typically VERY fast, it’s actually an advantage of mirrors. I would wager you had an SMR drive.

1 Like