Can I replace a drive during resilvering?

kls · December 1, 2024, 3:06pm

Yes, but I mean why TN used the two spares one in mirror-1 (with one drive failed) and the other one in the healty mirror-2 (no drive failures) instead to use one in mirror-0, keeping the vdev online?

etorix · December 1, 2024, 3:26pm

It looks like drives failed first in mirror-1 and mirror-2.
Do you know the timing of the failures?

kls · December 1, 2024, 3:41pm

I managed to recover the pool, without using the snapshots, and these are the steps I went through.

Starting conditions:
mirror-0 = I/O suspended and offline, 1 drive failed, 1 sane
mirror-1 = 1 drive failed, 1 sane, spare-1 kicked in
mirror-2 = 2 sane drives, 1 took offline and spare-2 kicked in
Resilvering is going on with expected time to completion 1+ month

1st priority: get mirror-0 accessible to replace the drive
I ran from shell zpool clear, and this allowed the mirror-0 sane disk accessible to be put online again;
With vdev accessible I offlined the mirror-0 faulty drive (Offline > Replace), swap with another one and resilver started.
Resilver of the mirror-0 took 45 minutes.

2nd priority: replace mirror-1 faulty drive
The vdev was accessible (never went offline) so I offlined the mirror-1 faulty drive (Offline > Replace), swap with another one and resilver started.
Resilver of the mirror-1 took 45 minutes.
At the end TN returned spare-1 to his spare role without manual intervention.

On mirror-2 no faulty drives and no changes made.
At the end TN returned spare-2 to his spare role without manual intervention.

Actually the pool is online, no data loss, and no error reported.

kls · December 1, 2024, 3:53pm

Probably better I start looking to other vendors… WD? Any suggestion?

etorix · December 1, 2024, 3:59pm

Your choice is between Seagate, Toshiba and WD—full stop. There’s no reason to exclude one because its SMART reports are less readable; the critical parameters require no decoding.

joeschmuck · December 1, 2024, 5:02pm

That was over 12 days ago. I can’t say the drive is good on old (yes even 12 days old) data.

Why? I still have yet to see any proof that the drives have actually failed. Data corruption, yes, but actual failure, nope.

On the drives you plan to replace, you should run a SMART Long test on each. See if they pass or fail. If you really wanted to, run badblocks on each drive (after you have replaced them of course) to determine if they are going into the trash or are now ready to be your cold spares.

Of course if you feel like replacing the drives regardless, that is fine as well, it is just many people here can’t afford to buy new drives when they are not actually bad.

Protopia · December 1, 2024, 6:05pm

That is EXCELLENT news!!! Welll done.

kls · December 1, 2024, 6:10pm

Thanks, I would be so lucky also the next time when it will happen again …

admin@TNscale20bay[~]$ sudo zpool  status SATA-pool   
  pool: SATA-pool
 state: ONLINE
  scan: resilvered 330G in 00:39:12 with 0 errors on Sun Dec  1 15:26:47 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        SATA-pool                                 ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            c8f6b750-b6e4-49cd-bd6a-900abf16f428  ONLINE       0     0     0
            fb142f4c-b40a-487c-9952-bffc8830cebe  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            1e25c7b8-0c84-42a9-83f3-a31699907349  ONLINE       0     0     0
            53ceb564-4db0-47b2-92e6-a2b9725ff3a6  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            04281099-6984-4e3a-8a38-a61f2ff04bb3  ONLINE       0     0     0
            58d0ad6e-3b1d-4bc9-81a9-6c3f9239b10b  ONLINE       0     0     0
        spares
          09bfcea9-1165-4b6a-8b04-ef0b5607e38d    AVAIL   
          25e2daa8-f839-4cb6-ad95-bf9bed968eaf    AVAIL   

errors: No known data errors
admin@TNscale20bay[~]$

This time I’ve been lucky!