Failed disk and frustrated with how immensely difficult it is to identify it

ElectricEel-24.10.2

So here I am with my happy go luck NAS on my R730XD and everything has been fine. But now I have a failed disk. Fine, I think, as I unpack a replacement. Then I start to look at how to identify the failed disk. The “Storage → Manage Devices” only shows me the failed disk with some weird number:

“17514897773373683542, UNAVAIL, No errors”

But clearly it has errors otherwise it wouldn’t have ejected it and marked as “bad”, right? Right?? Either way, no way to identify it in the chassis.

So then I go in Storage → Disks where it shows me some sort of serial number, and also happily shows me the linux kernal device id, but still no way to identify it in the chassis.

Like, where did the “blink” option go?

And as if that isn’t enough fun, that view marks all the disks (including the one it ejected out of the zvol for being bad) as healthy!

Im so confused. I’m trying really hard not to take an outage for a bunch of stuff just to shut the system down so I can manually pull each disk out one by one and identify which one it is.

Anyone got any advice?

*small edit: I realized that the error message does actually contain the serial number but I’m still trying to understand how to map that to a physical device.

Critical

Pool DataPool state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.

The following devices are not healthy:

  • Disk HUH721010AL5205 4DH4ZNUZ is UNAVAIL

I don’t know who manufactured your disk or what model it is, but on my Toshiba MG09, there is a sticker with a serial number of some sort on top that matches the one in the “Serial” column in Storage - Disks.
Confusingly there is another serial number at the bottom (top? the side opposite of the connectors) that doesn’t match with anything in the UI.
It depends on your enclosure but mine is a regular PC case and I can just shine the line into the drive cage and juuuuust about see that sticker, without even taking the drive out.

Yeah this fleet of R730XDs are rack mounted and hot-swappable back plane. There is no way to see inside. Im going to see if I can find the info from DRAC

This often depends on what hardware you’re using.

What do you get from sas3ircu 0 display

How many disks we talking? can you check them one by one, or are the serial numbers not on the drives? does your HBA/backplane ui have any sort of self test (i think on LSI cards you can press ctrl+H during post, and you can identify drives that way?). As tannisroot mentioned i usually pull em one by one and look for the last 3 digits, anything under 20 disks wouldnt take long (if you can switch off the server?)

If you don’t have visibility to the drive labels without removing caddies, then for future reference a best practice is to label the caddies on the outside with the SNs. A little work up front but makes things easier in a disaster.

Here is a link to a few script ideas to blink the activity LED based on drive SN;

You may be able to use the dd command in the post after the one linked using the sdx mount id of the failing drive. The Disks screen will show you that relation.

In your message HUH721010AL5205 is the drive model and 4DH4ZNUZ is probably the serial number.

4 Likes

Im not shutting down a server that is a core component that would require a shutdown of 20 other systems just to look at the label on a drive. Every other commercial nas system on the market can do this.

label drives? I’m not buying enterprise hardware to get home-user level solutions.

You didn’t buy an Enterprise iXSystems server, you are using a free community edition.

5 Likes

…on a server that was first sold in 2014.

4 Likes

roger, you didnt say much about the system. Does the server have activity lights? you could run a manual smart test on each drive and identify them based on the activity? Theres a few other suggestions in this thread if its any help? SOLVED - how to find physical hard disk | TrueNAS Community
Otherwise im outta ideas, Goodluck !

To summarise:

  • You have 20 clients and shutting the server down to locate the failed drive is too disruptive, 24/7 expectations without the HA setup
  • You didn’t have the foresight to test the procedure in advance
  • You have the idea that every NAS sold has this functionality when combined with software that wasn’t specifically adapted for the hardware

I am not sure what to tell you other than that your expectations are unreasonable.

You are fortunate to have received some creative suggestions on how to identify the failed drive despite the above, make the most of that or find your own way I guess.

iX offers the features you expect if you use their server hardware. Offering “it just works” LED integration in TrueNAS with all chassis on the market is a pipe dream. There are too many implementations to test and validate. Even using something like ledctl would need to be verified on your specific hardware setup. It’s too late for that if you are unable to pull drives without causing catastrophic disruption in your prod environment.

9 Likes

To your point, in my system, blinking an LED would be impossible, as Lian Li backplanes do not feature any LEDs. At the same time, my He10’s feature the last 4 of the S/N as a sticker on the back and so it’s trivial for me to identify a bad drive. I far prefer this system over hot swap bays, which usually only do one thing well: bake drives in their own juices.

At least some recently-released systems purchased from iXsystems can show via the GUI which drive has gone bad. That likely requires cabling consistency and a hardware dependency. It’s one of the cool features you get for paying a little more to buy an iXsystems-built NAS.

To the OP, my suggestion would be to carefully go through the STORAGE → Disks GUI, followed by the SMART menu to determine the bad drive enumeration (for example: sda or sdh) followed by using the SMART menu to determine the S/N of the drive you just identified as bad. Then hunt down the drive with the S/N.

I would not take the risk of pulling another drive while the NAS is running (even though w/a Z3, I should be able to do that here), as that would be inviting disaster. Instead, I would take the system down briefly on a weekend, label all the hot swap bays and perform an upgrade or two at the same time, i.e. make the most of the opportunity.

In the meantime, I’d slide your pre-qualified spare into a empty slot and fix the pool. Then schedule the shutdown / upgrade for a convenient time.

3 Likes

My old stack of 8-bay cheap NAS cases I was using as JBODs connected to a 9400-16e SAS controller did not have any easy way to detect failed drives, only basic power/activity LEDs. With this build, I identified the bad drive by looking for the one that didn’t have a flashing activity light. If you need an easy way to generate some traffic, could start a scrub or something.

When I built my new setup in Supermicro 12-bay chassis, with a couple 9400-16e SAS controllers connected to each Supermicro Backplane, I figured I’d be forward-thinking by creating a spreadsheet that had each disks serial number in a cell that corresponds with the disks’ physical locations in the shelves. When one of my disks failed, turns out the LED for the drive changed colors. Nice. Didn’t really need the spreadsheet now, but it’s still useful to which add-in card or slot different nvme disks are on.

Never tried TrueNAS Community Edition with a Dell server.

2 Likes

My oyen digital mobius 5 units starts beeping and the flashing LED indicates which drive has gone bad. The buzzer is really annoying and the parity recovery process is basically a black box for the average user as you just have to wait for the hard drive array activity to finally die down after a disk replacement.

The JM Micro software that does allow some insight into what the array may be doing was not updated for the last 10 years or so - on the Mac, it’s a strictly 32 bit app, so it cannot be run beyond Mojave.

All the hardware DAS OEMs seem to now be eschewing hardware RAID in favor of software RAID, it likely reduces their customer support requirements and lowers cost also.

I have no doubt something could be rigged to allow a better backplane experience in my Lian Li. But thanks to HGST, there is no need for all that since the drives already feature fully-visible S/N stickers.

find /sys/class/enclosure/*/*/device/block/sd[a-z]

That will give you the enclosure and slot id. From there:

echo 1 > /sys/class/enclosure/[enclosure id]/[slot id]/locate

Will make the slot light blink

If you wanted something comparable to a commercial NAS system, you should have bought one–iXSystems sells them, naturally running TrueNAS. But you decided to cheap out, use ten-year-old hardware, and now you’re surprised when you don’t get the same features? As Neo says, your expectations are unreasonable.

4 Likes

I wrote down the serial of each drive and what slot I put it in at installation… and I only have four drives.

As my mother always told me, if you fail to plan you can plan to fail.

5 Likes

I just took down one of my TrueNas to check out the disk layout.
Posting this here for anyone that may have a similar disk layout like this DELL system.
Matched the Serial Numbers shown in Disks of TrueNas to the Physical Disk
In the R710 Server. This is NOT matching the Disk ids, but the Serial Numbers.

Dell R710 Server - 6 Bay.

1 Like

Thank you, that is considerate of you.
Unfortunately the specific links are not going to be of much use for anyone else because they don’t have your specific system and serial numbers.

The device names on the left can and do change every time you reboot.
P5H3DDBV was called sde when you took that screenshot but may not be anymore.

The idea of recording how physical placement correspond with serial numbers in your own setup however, THAT is useful, and is what the OP should have done. Kudos for thinking ahead doing this.

2 Likes