Spare Drive not "Activating"

seanmcg182 · August 10, 2024, 1:41am

I had a drive fail a week ago, and I just found out today (a week later).
I don’t think my Spare drive “activated”? Based on what I’ve seen on other posts, it should say “INUSE” but mine just says “ONLINE”… is there a way to check if it’s actually being used currently (like it should be)?

Just want to make sure i actually still have my redundancy until i can procure a new drive…

Johnny_Fartpants · August 10, 2024, 6:16am

Correct it doesn’t appear like it has. How did your drive fail? Assuming you didn’t offline it and the system kicked it out of the pool then the hot-spare should have activated.

seanmcg182 · August 10, 2024, 6:51am

To the best of my knowledge, I’m think the drive controller failed.

I removed the drive from the server and plugged it into a SATA-to-USB adapter into my desktop for testing, and the drive doesn’t spin up, and nothing is detected by the computer.

And no, I did not offline it or remove it from the pool (you can still see it in the pool in the above screenshot). I hadn’t logged into TrueNAS in at least a month.

Now that I’m thinking about it, however, on that day (Date indicated in the Alert I noticed in TrueNAS), I did have a power outage. My Server is on dual UPSs, and I did gracefully shut down my server.

I’m guessing this drive just didn’t spin back up after it had stopped.

I assume if the controller is dead, on boot TrueNAS just read the (actually) Failed drive as simply “Removed”… and thus not triggering the Spare…

I’m not sure if that “should” have triggered the Spare as far as the program is concerned (obv in practice, it should have, but if the system didn’t detect an actual failure, i get it)…

Thoughts?

Johnny_Fartpants · August 10, 2024, 7:14am

Yeah not but I wouldn’t be surprised if the drive vanished during your outage and never came back that may well not be enough for the hot-spare to activate. What version of TN are you running?

seanmcg182 · August 10, 2024, 7:23am

TrueNAS-13.0-U6.1

It’s just crazy to me that it was working fine before a Graceful proper shutdown, and just never started again. Died in its sleep.

Davvo · August 10, 2024, 9:16am

Unless it’s a software remove the hotspare should kick in.

Johnny_Fartpants · August 10, 2024, 9:32am

Might be worth just checking that autoreplace is turned on.

zpool get autoreplace poolname

seanmcg182 · August 10, 2024, 3:16pm

Looks like Autoreplace is NOT enabled.
I can’t imagine that I would have disabled this… and also unsure of how/where to enable it.

Could it be because the Pool was created originally without the spare? The spare was added later.

Johnny_Fartpants · August 10, 2024, 3:43pm

Sorry ignore me autoreplace is also off on my pools and I know hot-spares work so they must be using something else to monitor.

Arwen · August 11, 2024, 4:36am

The only thing I can think of, is that the Hot Spare is too small.

This can happen if a different vendor is used. Even 1 Sector too small is enough to make it fail. Check the size of both the Hot Spare and the Mirror vDev with “da1”.

As for a storage device failing on power off, (graceful or not), it happens. But, to be clear, ZFS won't corrupt any existing data on a pool during such an event. (Unless you loose redundancy...) ZFS was specifically designed to support graceless and crash based power losses without pool damage, as part of COW, (Copy On Write).

seanmcg182 · August 11, 2024, 8:33am

Good thought, and I actually had never thought of checking that.
It is true that my Spare drive is a different vendor than the 8 drives I actively use (4-way stripe, mirrored)
Upon initial inspection it appears they are identical in media size:

HOWEVER, while running the command (gpart list da[*] ) I did notice something which looks like a promising lead onto my issues here… but I have no idea how to troubleshoot further… da7 is reading “state: corrupt” while also having a “Last” value smaller than da1 ?

Doesn’t make sense to me how they can be the same size and sectorsize, but not the same amount of sectors?

Any thoughts?

Arwen · August 12, 2024, 2:06pm

My highly un-educated guess is that the partition table for “da7” is corrupt.

My highly un-scientific suggestion is remove the hot spare “da7” from the pool, wipe it with simply zeros, then re-add it to the pool as a hot spare. The wiping will take a long time…

etorix · August 12, 2024, 8:49pm

If it is the partition table, just nuking the first few megabytes will do the trick.

seanmcg182 · August 18, 2024, 10:30pm

Ended up removing and wiping the drive with zeroes, and re-adding.

the “last” values line up, and the Spare drive doesn’t read as Corrupt anymore.

No clue what caused it in the first place, but hopefully it actually does it’s job next time