Failed disk and frustrated with how immensely difficult it is to identify it

What works on one system may or may not work on another. There are a few variables so if you find something that works for you then great but you shouldn’t assume it’s a silver bullet for all.

sas2ircu, sas3ircu, storcli, lsblk are just some of the tools you can use to help identify drives in your chassis.

Well, not really–while Robin is grossly oversimplifying things in saying iX are “simply using this issue to sell iX NAS Enclosures” (which they don’t sell anyway AFAIK; they sell complete servers), it certainly is something they’ve chosen to reserve for their hardware. Even if a straightforward, generalizable solution were found, I’d say the odds of iX accepting that PR would be zero. And if past experience is any indication, they’d probably close a thread here describing how to do it; they’ve done so with at least one other thread (I think from @NickF1227?) describing how to enable some “Enterprise” feature without the Enterprise license.

Of course, a sraightforward, generalizable solution is quite elusive. I have a solution (linked up-thread) that works for my hardware, and probably works on other Supermicro chassis with SAS controllers/backplanes. I’d say there’s a fair chance it would work on other manufacturers’ servers with SAS controllers/backplanes, but I don’t think I have had any feedback from anyone else who’s tried it.

Maybe it’s the case that a good general solution could be developed for the SAS controller/backplane arrangement, which could then cover many Supermicro/Dell/HPE/Lenovo servers. In theory, it seems you ought to be able to auto-discover the backplane layout in this way, and auto-populate it with the disks installed in each slot. From there, blinking a light (in the tool’s UI and/or on the physical machine) would be relatively trivial. But while this may be possible in theory, I haven’t seen that anyone’s done it yet, which indicates that the appropriate combination of skill and motivation is lacking.

What could be generalized to pretty much any hardware, though, would be a tool that works something like this:

  • User draws drive layout as one or more grids of rows/columns to represent disk physical locations
  • Via drag-and-drop or other pretty GUI means, user places disks in the appropriate slots
  • Tool monitors disk status and shows appropriate blinkenlights
  • (bonus feature, but would be nice) Tool handles disk replacements using the TrueNAS API, prompting user to place the replacement appropriately in the defined enclosure (otherwise, the user needs to update this tool any time a disk is changed, meaning its information will likely get outdated in a hurry).
  • For even more bonus points, allow full storage management using this tool, again via the TrueNAS API. Adding vdevs (or removing if possible), expanding RAIDZ vdevs, creating new pools, etc., all could be done in this way.

I think I know enough about the relevant moving pieces to be confident that such a tool is possible, but I have nowhere near the skill to develop it.

1 Like

Plus a sufficient collection of various hardware for the developer(s) to directly test a reasonable comprehensive set of situations.
If the code is not accepted by iX, it could be of use to zVault :wink:

But that’s would be a tedious and thankless endeavour.

And here comes the rub: I suspect that most requesters want TrueNAS to automagically draw their hardware and place everything without any user input. If they had the patience to draw and arrange (and keep track of changes), they would have the patience to put visible stickers on their drives or their trays with (some relevant part of) the serial number. The ultra-low tech, 100% effective, solution…

2 Likes

Plus - doesn’t even Linux tend to shuffle device names around?

FreeBSD for sure does if disks fail completely and the system is rebooted. Which is probably why TrueNAS uses GUUIDs in the first place.

3 Likes

Linux does reshuffle drives on reboot. Not to mention adding or removing drives: I have very bad memories of GRUB losing its marbles because I dared add further drives to my storage while the boot device was also a SATA device.

So a design decision has to be made: Should the wonderful graphical assistant tie drive serial numbers to the designated position (user is responsible for updating when shuffling drives around) or should it then associate the position to some invariant hardware identifier (ACPI port device?) and then automatically track which drive pops up at the position (user is responsible for updating when shuffling cables around)?
Let us not be bothered by mere implementation details…

1 Like

While FreeBSD and Linux “shuffle devices around” at boot*, SunOS/Solaris does not. SunOS keeps track of which physical device path has been assigned which device name in a file name /etc/path_to_inst. There are a few cases where you should (need to ?) go into /etc/path_to_inst and remove entries. A very old system which has had lots of hardware changes can end up with lots of unused device names.

*As far as I know, every modern OS probes all of it’s hardware at boot, loading the correct device drivers as needed. It does this in a pre-defined order, so if nothing changes the device names stay the same.

.

1 Like

Giving us the lovely device names like c0t0d0s0 :slightly_smiling_face:

Admittedly that makes a lot of sense:

  • controller 0
  • target 0
  • disk 0
  • slice (partition) 0

But I never had to edit device mappings. If there were major changes you performed a reconfiguration boot to force a new detection and enumeration from zero:

touch /reconfigure
init 6
1 Like

Sure. But things do change. One time HDD1 may initialise in 2.3345 seconds and the next boot it did so in measly 2.634 seconds resulting in HDD2 finishing before HDD1…

If you have an HBA that lets you control the spin-up order then sure, you can control this to some degree, but that will be of no help for those running drives off the motherboard SATA controller.

Have you seen this?

Every system I have touched takes longer to load the kernel than the drives to spin up (unless you have staggered start enabled, in which case the drives stagger their start in the same order every time). The drives generally start spinning up when power is applied.

I think the whole deal with device names changing is shrouded in mystery, I am trying to shed light and explain why the devices are named in the order they are and why the device names change. Note that UUID, WWN, SN, and path based names do not change (path might if you added or removed an HBA).

A failed drive will also change the device nameing, as it may not exist, which is a change.

I have seen too many people loose data due to not understanding why something in their system changed.

Yes, I have had drives swap device names (sba and sbb swapping places) without anything changing physically with the configuration other than an ordinary reboot.

2 Likes

I apologize, I was unclear. I meant have you seen device names change due to differences in spin up times of drives.

Do you know why sda and sdb swapped names?

Do they do that every time you reboot or only some times?

No, I can’t prove what the underlying cause was as I lack the required equipment and knowhow on how to conclude that empirically.

I would describe it as not necessarily every time but I don’t reboot especially often to say a certain percentage.

There was a thread (not mine) on the old Forums for how to turn FibreChannel on in CORE that I remember, but thats all I can recall. How TrueNAS Enterprise does slot mappings is on GitHub for anyone to view. It may be possible to write a similar script extension for your own specific hardware. However, there be dragons.

Here’s my brief exploration of this topic

As for this thread I had suggested using NetBox as a source of truth for drive mappings…and I still think that’s a valid solution to this problem. Maybe one of these days I’ll do a writeup for NetBox because it’s really quite useful to have a IPAM/DCIM tool in your homelab.

Also @dan I just saw your script. zpscan-scale/zpscan-scale.sh at master · danb35/zpscan-scale · GitHub Very nice. Obviously not a product endorsement, just looks cool. I wonder if you could do the same thing without using encled. I think sg_ses can do it.

It’s been posted here several times already, but the method I found that best suits my installs is a spreadsheet of drive serial numbers along with other basic drive info like capacity, when the drive was purchased, installed, replaced. From that list, I make a simple visual installed location map on the same spreadsheet of the chassis slot where each drive in the server is installed that shows the serial number. I can look at the location chart for the serial number of the failed drive and pull that drive out of the chassis. Before I replace the failed drive with a new one, I change the serial number and other associated info in the list which will also change it in the location map automatically.

With this sheet I always know where the physical location of a drive is and the sheet can be printed out and hung in a sleeve on the server/rack if necessary. This sheet is not system or server dependent. The map can be re-ordered to match any server to be used with any server or if the drives are reused with a different server chassis, eliminates caddy stickers that can fall off, be smudged, faded, fail to get updated/replaced on a drive change, is not dependent on any drive ordering of any operating system, not dependent on lights, blinks etc., and is immune to all but someone randomly physically reordering drives without updating the sheet. But then that person should not be working for you anyway.

3 Likes

I use both… spreadsheet with drive information including chassis location as well as Brother P-Touch laminated 9mm labels. The laminated labels do not fade or smudge. The do fall off if you don’t clean the drive carriers first.

6 Likes

Yes, but that’s susceptible to:

For it to be most useful, it should incorporate pool management as well, which would include replacing drives in the matrix. Definitely a bigger ask though.

I’m pretty confident it can be done. Whether I can do it is a different question. It’d be nice to get rid of that dependency though.

This is what I use.

Last 4 digits of the serial where its visible. That generally means on the drive cage.

Part of replacing a drive is updating the sticker.

Maintaining a spreadsheet. That didn’t last.

2 Likes

Started making labels to do this, so I wont have to consult a text file, and realized I’d wired the drive bays to the motherboard deliberately for identification. :smile:

ada# is connected to the bay labeled as # (from the case’s included label stickers) – so the serials really don’t matter.

…assuming the drives never reshuffle.
Belt and suspenders, and all that sort of things.

1 Like

:thinking:
Curious what causes reshuffles and how often they occur. if it was a HBA card I can see that happening if I moved it in slots, but this is the motherboard’s own ports.