Failed disk and frustrated with how immensely difficult it is to identify it

joeschmuck · April 27, 2025, 3:10pm

@neofusion has addressed to issues to thinking that this data will not change, as he said, it can and will change when you don’t want it to.

Here is my solution and it works for me. In the Descriptor field, specify the location of the drive. You can enter anything in this field. See my screen capture for the example. And this does survive in the TrueNAS configuration file and system upgrades.

carrzkiss · April 27, 2025, 6:49pm

I believe you both misunderstood what I stated in my post.
I updated my post to make it sure the people read the Serial Numbers not Disk IDs.
I posted by specs because the asker stated this.
R730XD
Which is a DELL System, the same as mine, just a few versions newer than my R710.

Not the Disk IDs, but by order of the Serial Numbers.
They do not change.

So, in my picture it shows the drive serial numbers in the same slot order, the drive IDs are of course different as you both points out.
But, the Serial numbers stay in the same order.

Thanks Joe for the information on the Description, that will come in handy.

SmilingInSeattle1 · April 28, 2025, 12:13am

I’m going to address this as a TrueNAS Core (FreeBSD) user rather than as a TrueNAS Scale user, even though Manjaro (ARCH) Linux is my desktop OS. I’m hoping that the TrueNAS Scale Web page is sufficiently similar, thus, you should be able to find the disk that has failed.

Originally On Core I had to put together a bunch of BSD Commands to be able to find a failed disk when I got a notification that “wasn’t so obvious.” As a result I ended up building a table of a Disk Description (model and type), Serial Number, GPTID, GUID, associated pool and VDEV. This was a combination of information from a script I found to associate drive names I found on what is now the archived forum:
run this script after sSH’ing in:
and changing to “superuser”

myname@freenas.local:>su
myname@freenas.local: passwd

!/bin/sh
echo
echo $(basename $0) - Mounted Drives on $(hostname)
cat /etc/version
date
echo
diskinfo=“$(glabel status | tail -n +2 | awk ‘{split($3,a,“p”); print a[1],$1}’)”
echo “+========+==========================+==================+============================================+”
echo “| Device | DISK DESCRIPTION | SERIAL NUMBER | GPTID |”
echo “+========+==========================+==================+============================================+”

I don’t see any BSD commands in this script so it may work as is under SCALE as long as the file structure is the same in SCALE. This will give you a table to start with.

Now the really neat script (and main command) that associates disks and their relationship to the zfs structure is this one:

:>zdb -C -U /data/zfs/zpool.cache

On CORE, this has to be run as root, or with amplified permissions. This will give you an MDS Configuration, structured by indents, that reminds me of an XML file with lots of information, and by intuition, the very structure of the ZFS database structure. Using this information by noting similar indices from this output and output of the above script you’ll get just about all the “references” to the various ways that a zpool failure status will be reported to you.

So much for doing it the “hard way.”

Now the “quick” way that might well get you the relationship of a serial number to a GUID or a device location is to go to the Dashboard, click on the small “>” arrow on the right side of the “POOL” box which will replace it with a VDEV box, and to the right of each VDEV you’ll see a similar “>” that will take you to info for each disk including the GUID and s/n of each disk on line. Now, if you know something already about the physical disks, and you got a GUID or LUID, you’ll be able to detect the failed one, as it won’t be online. The relationship of gptId’s can be found by doing a >zpool status and noting the table presented,

As far as finding the failed disk in your trays, that is a result of “good housekeeping” that should start when you put the hardware together. My project consists of 10 data disks. Visibly eack bay has a label I printed using mailing labels created with the gnome program glabels. Each label contains the Serial Number, gptid, guid and date of installation. On the edge of the tray enclosure I have a small label that indicates the common device name; in BSD parlance, /da0, da1, adad0,ada1… My system is an old HP Proliant Gen 7 that “grew” over the years. The original unit had SATA dev (CAM) namings. As it grew, I added an HBA and then had /daX and /adaX devices. I added a couple of 4-bay SAS cages from Athena Power that Newegg was handling on and off. Realizing the SATA controller on the Proliant was holding me back, I just added a SAS multiplier and moved control of the cage built into the Proliant on the the SAS multiplier and now have the ability to add 8 more drives if my media collection grows substantially more. With this, I had to learn how to read SAS configurations using the mpsutil commands, which can help you locate physical location to info like the SSN.

FrankWard · April 28, 2025, 2:56pm

Did you write down the serial AND label of the disk before using them? Did you get the ID of the disk once it was recognized by TN? These are often different especially for refurb drives. Did you document EXACTLY which serial/label you wired up to the specific controller/port?

This is basic task everyone engineer should complete when purchasing and deploying disks. I label all of my disks before using them and document everything about the process. I suggest you do the same in the future. It makes handling the situation you are in 100x easier.

Fred_Stephani · April 29, 2025, 8:31pm

Depending on your HBA you maybe able to find a cli command to blink the drive by serial number or id.

VR46 · May 1, 2025, 2:12am

Have you been in the iDRAC yet? Under storage it will show you serials and likely already say which bay has an error. You can also blink from there but you can just use the bay number:
0-11 for 3.5"
0-23 for 2.5"

dgtadmin · May 1, 2025, 4:17pm

Since this appears to be a linux based truenas, you can identify the drive, /dev/sd(xx)'. If needed correlate the zpool disk label with the list of devices in /dev/disk/by-id, as that should help. The use the linux command hdparm -tT ‘disk device /dev/sd(xx)’, as this will cause the drive in question to sustain a lot of activity hence blinking the drive light in a way that should help you id the drive.

siliconmcleod · May 1, 2025, 7:00pm

I am late to reply to this, but it is super easy to figure out which drive is in which slot on Dell servers (as long as you have an iDRAC). It lists out all the drives and slot numbers!

afrosheen · May 1, 2025, 8:01pm

Let this be a lesson to anyone coming across this thread. A stitch in time saves nine. I always bust out the labeler and print serials for each drive and stick them on an easy to spot location so I don’t run in to this. GUID, /dev/yomomma, don’t care what trickery is made to the drives. The serials and physical locations NEVER change.

It has come in handy once, and that’s enough reason to do it on every build. I don’t have time to turn my nose up at pleb solutions for commercial hardware.

VR46 · May 8, 2025, 2:47pm

Solid advice. Instead of labels, I keep mine in a spreadsheet. It’s mostly for the two non Dell disk shelfs though.

afrosheen · May 8, 2025, 3:02pm

It’s interesting how people arrange data. Someone else might enter this into a cmdb while someone else may include the drive sizes in the serial cells. I’d just hard label with stickers and take a picture. Then edit the picture to label some stuff. Old school.

Jorsher · May 8, 2025, 4:04pm

I have something similar to this. Just using the spreadsheet cells to represent the location of each serial number on the disk shelves.

afrosheen · May 8, 2025, 9:19pm

Careful, we think about this too much, it’s gonna turn into a feature request that some little perl script can knock out in ten seconds.

PK1048 · May 8, 2025, 9:28pm

Unfortunately, the problem of associating disk identifier with physical slot in chassis for (effectively) random chassis is not a simple problem. Especially if the drives were not all in place when the system was booted.

dan · May 8, 2025, 9:32pm

…and that really is the problem. I can say from experience that, for Supermicro’s SAS backplanes, it’s pretty straightfoward–I have a script for that (GitHub - danb35/zpscan-scale: Light drive failure LEDs on SAS backplanes under TrueNAS SCALE). It might work on other SAS backplanes as well. But there’s lots of different hardware out there, and I can’t speak with any confidence for anything other than what I’ve personally used it with.

afrosheen · May 8, 2025, 11:51pm

True. It has to be a manual process because somewhere, somehow, nobody agreed on physical layout and bothered to standardize it. There is some sense in hardware occasionally, like minipc’s labeling ethernet ports eth0 and eth1 (finally) so you don’t have that mental step of translating everything from the outside to the inside. Bla bla bla, but anyway, a photo or some kind of drawing is a good idea in any setup.

Robin_Curtis · May 30, 2025, 3:59am

The truth is they simply using this issue to sell ix NAS Enclosures. There is no reason for this, as the information can be retrieved from sas2ircu or storcli and they connect to many enclosures via a large range of controllers. So there is no reason not to make a template for different enclosures and allow users to create these templates. It is easily provable since everyone can use the cli to retrieve this information anyway.
Just fed up with bad excuses .

etorix · May 30, 2025, 8:03am

If you think it’s easy, you’re welcome to contribute the code…

(Hint: Think about backplanes randomly wired to motherboard ports, as well as consumer cases without backplanes or defined enclosures.)

Robin_Curtis · May 30, 2025, 8:09am

Try this

storcli /c0/eALL/sALL show

it will show all the drives and you can grep the data like…

storcli /c0/eALL/sALL show all | egrep 'Device attributes|SN = ’

now you can see the slot and sn of each drive.

now I want a drive led to blink to identify the drive.

/c0/e72/s6 start locate

really that simple.

etorix · May 30, 2025, 8:34am

admin@draco:~$ storcli /c0/eALL/sALL show
-bash: storcli: command not found

admin@draco:~$ sudo storcli /c0/eALL/sALL show
[sudo] password for admin: 
CLI Version = 007.2807.0000.0000 Dec 22, 2023
Operating system = Linux 6.12.15-production+truenas
Controller = 0
Status = Failure
Description = Controller 0 not found

And this a server motherboard (A2SDi-4C-LN4F) in a Supermicro SuperChassis SC-721, as used by the TrueNAS Mini. Should be simple, you say?