Turn a "spare" vdev back into a "mirror" vdev

gnewt · May 4, 2024, 5:10am

I have created through a mixture of neglect and ignorance a pool with two VDEV’s which started out as a two mirror VDEVs. I added a hot spare which has subsequently been used, and I have removed the related failed drive. This is the status of the pool as it stands now:

root@kore[~]# zpool status heracles
  pool: heracles
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
	The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: resilvered 64.6G in 00:08:33 with 0 errors on Thu May  2 08:32:47 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	heracles                                  ONLINE       0     0     0
	  spare-0                                 ONLINE       0     0     0
	    bd2928b9-abb0-46b8-95cb-f6a2b775017e  ONLINE       0     0     0
	    32f92237-eb6f-403f-9135-a1d4b54ab9c6  ONLINE       0     0     0
	  mirror-1                                ONLINE       0     0     0
	    a2b236c0-20c3-4c16-a5cc-f0ecdf951440  ONLINE       0     0     0
	    55c81f95-608c-46d7-b2f7-1e9d9865329b  ONLINE       0     0     0
	spares
	  32f92237-eb6f-403f-9135-a1d4b54ab9c6    INUSE     currently in use

errors: No known data errors

I am awaiting delivery of new disks as the “INUSE” spare has had a few errors in the past - I can only add a single disk at a time due to SATA port occupancy.

Is there ANY way I can ‘convert’ the vdev ‘spare-0’ back into ‘mirror-0’? Is replacing the “INUSE” spare with a new drive going to get me back to where I need to go?

Hjalp please.

etorix · May 4, 2024, 8:38am

What you have here is a stripe of a mirror and a single disk vdev Are you sure you removed the right drive?
The single disk had a failure and is being replaced by the spare. From there, ZFS expects decisions by the administrator.

You can extend the single drive with a new drive to make the vdev a mirror and remove the spare.
Storage > Pool > (gear) > Status > (disk) > (…) > Extend

Protopia · May 4, 2024, 9:00am

DISCLAIMER: I am NOT a ZFS expert. Get corroborating answers before doing any ZFS actions I suggest.

I actually see only 4 drives here - because the spare is listed twice. However I would have expected the 5th drive to be shown as UNAVAIL something like:

root@kore[~]# zpool status heracles
  pool: heracles
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
	The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: resilvered 64.6G in 00:08:33 with 0 errors on Thu May  2 08:32:47 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	heracles                                  ONLINE       0     0     0
	  spare-0                                 ONLINE       0     0     0
	    bd2928b9-abb0-46b8-95cb-f6a2b775017e  ONLINE       0     0     0
	    xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  UNAVAIL      0     0     0  cannot open
	    32f92237-eb6f-403f-9135-a1d4b54ab9c6  ONLINE       0     0     0
	  mirror-1                                ONLINE       0     0     0
	    a2b236c0-20c3-4c16-a5cc-f0ecdf951440  ONLINE       0     0     0
	    55c81f95-608c-46d7-b2f7-1e9d9865329b  ONLINE       0     0     0
	spares
	  32f92237-eb6f-403f-9135-a1d4b54ab9c6    INUSE     currently in use

errors: No known data errors

(See ServerFault.com for an example.)

What does the TrueNAS Storage GUI show under “Manage Disks” and “Manage Devices”? Can you post screen shots please?

That said, the pool looks good, and hopefully you will not have any difficulty getting it back into the correct shape. I think we should see your screen shots before you do anything, but in principle here is what I would do?

I think the first thing to do is to see what happened to the failed drive. Can TN still see the 5th drive and if so what is its status? Can you run a short / long SMART test on it? What errors does it report?
Next you need to restore this pool back to the normal state without a spare. Unfortunately according to Solaris ZFS docs using the CLI you would do this with a zpool detach heracles xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx and this would also switch spare-0 back to mirror-0. So I am not sure what you should do here since the failed drive is not shown in the status output. But you should try to do this via the GUI in order to keep the GUI in step with the pool definitions (and avoid an export/import).
Then replace the failed drive with a new one and once it is visible in TrueNAS, run some tests on it e.g. a SMART long test, and if that is ok, add it to the pool as a spare, again using the GUI. (Or do it the other way around so that you have the spare available as quickly as possible.)

etorix · May 4, 2024, 12:10pm

That would be more like:

NAME                 STATE     READ WRITE CKSUM
heracles              ONLINE       0     0     0
  mirror-0            ONLINE       0     0     0
     xxxxxxxx…        UNAVAIL      0     0     0  cannot open
     spare-0          ONLINE       0     0     0
	   bd2928b9…      ONLINE       0     0     0
	   32f92237…      ONLINE       0     0     0
   mirror-1           ONLINE       0     0     0
	 a2b236c0…        ONLINE       0     0     0
	 55c81f95…        ONLINE       0     0     0
   spares
	 32f92237…        INUSE     currently in use

which is why I suspect that the wrong drive has been removed. Or there was a more fundamental error in creating the pool.

gnewt · May 5, 2024, 1:39am

Thanks @etorix and @Protopia.

Yes I have removed the UNAVAIL drive. I can’t give any update until I receive replacement drive(s) arriving early this week and today I’m hosting a 1st birthday party so not permitted to abscond to the study…

@etorix your picture is what I recall from before my ignorant REMOVE action.

I’ll give more detail when I have fresh drives to add…

Protopia · May 5, 2024, 8:06am

Do you mean that you have physically removed the drive?

What does the zfs status show now? What does the GUI show now?

I recommend that you do NOT make ANY further changes without getting advice here - it is pretty easy to make a mistake and lose your data if you just do things without being absolutely 100% certain that they are the right things to do.

But it is your NAS and it is your data, so you do also have the absolute freedom to do as you wish - but you will have to live with the consequences good or bad.

Stux · May 5, 2024, 8:20am

If everything is fine, and there is adequate redundancy, I would burn in the new drive before replacing it.

I fact, I always make the spare the replacement. And then burn in the new drive, which becomes the spare

Protopia · May 5, 2024, 8:34am

@Stux “Burn in” sounds like a good idea especially when the drive has no actual data on it - kind of a stress test. What do you use to burn in?

Stux · May 5, 2024, 9:25am

This script. Run it in tmux. I save the logs for posterity.

Here’s hoping @Spearfoot checks in one day…

gnewt · May 6, 2024, 3:01am

I have five drive bays (and SATA ports) available in total. My biggest mistake (so far?) has been to “REMOVE” from the pool the totally failed drive in state UNAVAIL. I believe now that I could have physically replaced it with a new drive, then replaced it in the pool with the new drive.

I do not understand the meaning of the ‘spare-0’ element. Clearly the drive I had designated as a spare has been combined with ‘bd2928b9…’ in some way - is it replacing it or mirrored with it? My current guess is: it’s a mirror, where the hot spare has replaced the now-removed drive.

My current plan-of-last-resort is to reanimate my previous TrueNAS server and migrate all data onto it, remove the pool completely and recreate with four good drives (yes, I’m sticking with a stripe of two mirrors).

@etorix your original suggestion still sounds feasible (extend the vdev to turn it back in to a mirror) so I’ll see if that works first (and seek further advice on dealing with the spare if needed). Still waiting for a courier to arrive…

@Stux I guess my limited experience with other SAN systems led me to the expectation that the spare would automatically become the replacement in the case of a failure. I believe I’ve cut myself off from the option to use it in that way this time, though. I will certainly be considering burning in, thanks for the advice.

gnewt · May 6, 2024, 4:56am

@Protopia I do have freedom. It’s not always the best thing, as I’m proving. Reanimation of my old TrueNAS, replication, recreate pool, replication is my next step as the alternatives seem to be unworkable…
…largely because I did a stupid thing.

sfatula · May 6, 2024, 5:28am

ZFS will use a hot spare to automatically replace a drive (resilver to the hot spare) if configured properly and at least as large as the drive it is replacing. When I was using very old drives, and mostly playing around, happened about every month for me, lol. It does not however do anything with the old drive. That is manual. That’s what a hot spare is, a cold spare is all manual.

etorix · May 6, 2024, 8:04am

The spare is mirroring the failing drive but not replacing. ZFS does not take decisions about replacing drives: Decisions are to be made by the administrator.

Possible choices include:

Trusting the failing drive back into the array, and returning the spare to spare.
Replacing the failing drive, and returning the spare to spare.
Making the spare a permanent data drive, and removing the other drive.
Making the spare a permanent data drive, and turing the other drive to spare.
Making the spare a permanent data drive, and bringing in another drive as spare.

gnewt · May 6, 2024, 12:15pm

Hi @etorix thank you (again) for your support.

Of these, the only feasible options seem to me to be those that involve “Making the spare a permanent data drive”. In the GUI I cannot see any obvious way to do that. To clarify, this is what I currently see:

I have identical model drive(s) now available, one of which is currently ‘warming’ in the vacated slot of the failed drive. Having got myself into this corner, I’m not willing to risk hitting the ‘replace’ button on ‘sdd’

etorix · May 6, 2024, 12:48pm

“Remove”, “Detach”, or “Replace” on sdd or sdc, as appropriate, should make any option possible.

Ideally, new drives should be burnt-in for some time before being put into production.
And, given the particular situation, the most useful button is “Extend” to make the vdev a 2-way mirror again. However I’m not sure what’s the best, or even the correct, order of steps here: Extend and then remove, or remove and extend.

sfatula · May 6, 2024, 8:19pm

The doc for hot spares which should detail it out for you is here:

UI confusing to me since I am used to raw zfs commands.