Advise replacing disks with hot spare active

Hey everyone,

I have already searched a bit, but couldn’t really find good advise for my specific situation.

I have a pool where one disk failed. Took it offline and the Spare that was configured automagically kicked in.

I then went and physically replaced the drive, so I now have a spare drive in the system again. The system we are talking about, is a DELL R740xd2 with TrueNAS Scale.

The current situation looks like this:

INSERT IMAGE HERE

As far as I read there are basically two ways to handle the situation now:

  1. Replace the offline drive with the new drive that was installed, which leads to another resilver and automatically recreates the spare drive.
  2. Detach the “OFFLINE” dive that had failed and recreate the spare

I would prefer option 1, because the spare drive sits in the back of the device and is not just part of the regular 24 drive enclosure.

Are there any reasons to prefer option 2? Like spare was already used, so it shouldn’t become a spare again, or that no further resilver is required?

Thanks!

The first step would be to assess why the spare kicked in, and what’s the condition of the drives (long SMART tests).

My preference would go to option 2., with the old drive becoming a spare if it is healthy, or being discarded if not.

Thank you for the reply! I couldn’t upload an image for some reason.

The spare kicked in, because one of the drives failed due to its age. It accumulated errors and thus was removed from the pool. I offlined it, after I had 3 consecutive reports about errors during a smart test.
The offlining I guess lead to the spare kicking in.

The drive was only running for a week, so i am not that concerned about the spare’s health, just wanted to know, what’s best practice, or to be preferred.

We have regular scrubs and smart testing set up, so the rest of the drives is healthy.

We’ve recently moved to advising the second route in documentation, and afaik our Enterprise support has done so for a while, because there is a non-zero risk of data loss should a second drive fail during the second resilver (which puts additional stress on the discs). Not to mention the additional downtime, which can be significant in large Enterprise deployments.

That said I understand your reasons for wanting to replace the drive in place rather than use the hot spare. You’ll have to weigh the risks, preferably after considering the age of the other drives and running long tests if you haven’t recently, as @etorix suggested.

1 Like

Then make the hot spare a permanent member and buy another spare… in anticipation that further drives may fail due to old age. You may even consider this fist failure as an early warning and preventively replace all the remaining old drives before they actually fail.

1 Like

Thank you for the replies!

TO give a little more context, the storage pool consists of 3 vdevs, with 8 drives in RaidZ2 each. The drives are 16 TB Seagate Exos X16 drives all from 2022 or 2023. So I guess the risk of all of them starting to fail now are rather slim, especially since the storage server is not under what i would consider a huge load.

But you’re right, I think I wouldn’t want the additional resilver and I’ll probably just have to remember that Bay0 is actually in the back of the server, not in the front with the rest of the drives.

Thanks very much again!

You can move the drives around as you want.

1 Like

Oh right, that’s an option, I haven’t even considered. So I just power down the system, then swap the drive to the corresponding drive bay in the front and bring the new spare into the back? No resilvering etc. involved?

No, ZFS tracks drives by UUID and does not care about location or how they are connected.

1 Like

Sure, I figured as much, but is the process I described correct? Because if I pull the drive while the system is running I guess I’ll run into an immediate resilver.

Yes.
If you trust your hot-swap bays, do it “live”, and there’s no hot spare, the pool will be degraded but nothing else will happen; a short resilver will occur upon upon plugging the drive back elsewhere.

1 Like