Resliver after Resliver ... how to get out of this loop

I have 8 12TB drives in a RaidZ2 format. I recently added 3 other HDDs to a SPARE vdev. I think I am starting to regret that move.

A drive reported an error, spare swapped in automatically and resliver started. That triggered another drive to error and another swap, and another error.

Anyway … 4 days of reslivering has given me this …

zpool status screenshot

previous zpool status screenshot

I suppose it is getting better.

Suggestions? Just wait. And wait?

I wonder why the drives were degraded / removed. And yeah - its a matter of waiting and hoping.

You do have a backup don’t you?

1 Like

Check SMART reports of all drives to decide which ones are healthy.

Thx for the replies … and that flow chart. I will look at that.

Yes - I have a backup. I have several back ups of the critical items.
And I run nightly SMART reports and tests.

7 hrs for this resliver to finish … and the next one to start :).

I had 30 drives vanish from a 90 bay JBOD once which had 4 hot-spares in the server head. The output looked similar to yours but once I replaced the issue (faulty JBOD) the server resilvered one at a time and all was fine.

I’d say give it time and try to identify the underlying issue. It may just be bad drives but it may also be another issue.

1 Like

reseat all your cables (with the system off, of course). I had similar behaviour with a loose HBA connector.

Resliver finished. Finished for 40 mins now without kicking of another one. Here is the current status …

current zpool status

Now … if I am reading this correctly,

  • bff (first 3 letters of guid) is mirrored with elc
  • elc is dead … and I can pull it / replace it
  • 5eo is mirrored with d25
  • spare bff is in use
  • spare d25 is in use

So … I think my next step is to replace elc with bde. Is that what I should be doing?

BTW - You really can’t trust these HDD names can you? Either it is vdev, PARTUUID or serial number … and they keep on shifting. My GUI says there is one disk available to be added to a pool (sdd) but that vdev is associated with 5e1599a4 which is online … the elc (removed) disk is no where to be seen.

Yes, we say this a lot. A person cannot reliably track a drive by the assigned drive name, it can move when the computer is rebooted.

Yeah - I read that alot too. Annoying … but we really need a way of stringing all of these names together. I have a script (HDDInfo) that runs twice a month that collects all their ‘names’, what pool they below to, are they a spare (I also add their physical location to their description) and emails that info to me.

I thought I could trust that. It now looks like I can’t. I think I need to relook at the logic I am using for pool name.

Something weird about my spares. I have 10 disks. 8 in Raidz2 with 2 hot spares. For some reason, it was showing me 3 hot spares. I had to detach the ‘free’ hot spare and then use that disk to replace the removed disk.

Resliver in progress.

How do you remove the spares from those ‘mirrors’?

Just in case, how is the power supply to those disks?
I had some weird issues when trying to run 12 disks from one psu ‘line’.
When I did 6 per line, most issues we’re gone.

Adding 3 drives and suddenly errors appear is not some coincidence.

I have 4 x 4HDD cages. I believe that each cage is on a different sata cable from the PSU.
Hmm … it might be 2 cages per feed from PSU. I will check that out.

Resliver finished. Status is now healthy. I still have 2 spares attached in the SPARES vdev, one in use and one available. I removed the available one.

GUI reports 2 drives available to add to a pool (the spare I just removed) and another that is actually part of a pool (see screenshot).

disks in pool and disks available to add to pool

Why does sdf appear in both groups?

Can we see the output of zpool status

I’ve resolved this duplicate name issue. I replaced one of them with a genuine spare disk. Once I got rid of all of the mirrors, I removed the hot spare vdev.

I am not doing that again (hot spare vdevs) … in future, the system can just be degraded until I can manually swap a fresh disk in.

2 Likes

Blockquote
Just in case, how is the power supply to those disks?
I had some weird issues when trying to run 12 disks from one psu ‘line’.
When I did 6 per line, most issues we’re gone.

I checked on the sata power cables to HDD cages this morning. I had it set up as 1 power cable to 2 cages (8 HDDs max).

I have changed that to 1 power cable per cage (4 HDDs max). My 2 x SSDs (boot) get their own power cable (which is probably overkill).