I think the biggest challenge here is the non uniformity of hardware.
For folks with real SAS implementations sg_ses does exist for this kinda thing, folks with Supermicro 2/4U servers as an example can flash the lights on a drive with those tools. But alot of folks don’t even have SAS backplanes, and none of this works with SATA.
Then let’s wrap in Dell/HP/Cisco/Lenovo whose back planes may not be as straightforward and may have proprietary communications for these types of functions.
I had some HP servers which were 12 drive LFF but only had a single 8087 (4 lane) SAS connector and a SAS expander. Sesutil would flash the wrong drives or wouldn’t even light any drives when I tried to use it.
A similar story can be said for disk shelves. You should be able to query voltage, power supply status, fan speeds, etc over an external SAS cable. But vendor implementations differ.
Here’s a couple of enclosures for example.
The output mentions several unique elements (array device slots, power supplies, cooling fans, and temperature sensors).
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Element [0-9]+ descriptor" | wc -l
59
If I compare that to my EMC enclosure, I see almost double the values that TrueNAS would have to parse and scrape.
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Element [0-9]+ descriptor" | wc -l
108
Let’s hone in something easy, say temperature. I can see that both shelves report temperatures in similar ways, and have the same amount of temperature sensors.
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Temperature="
Temperature=37 C
Temperature=37 C
Temperature=27 C
Temperature=27 C
Temperature=32 C
Temperature=23 C
Temperature=25 C
Temperature=32 C
Temperature=25 C
Temperature=23 C
Temperature=27 C
Temperature=23 C
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Temperature="
Temperature=33 C
Temperature=35 C
Temperature=47 C
Temperature=31 C
Temperature=33 C
Temperature=43 C
Temperature=46 C
Temperature=36 C
Temperature=35 C
Temperature=70 C
Temperature=36 C
Temperature=35 C
root@rawht[~]#
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Temperature=" | wc -l
12
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Temperature=" | wc -l
12
But if you look at the full output of the temperature, it’s really not very clear what or where that temperature is. One of these sensors reports being 70 degrees. Should I be worried? I have no idea!
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=33 C
Element 1 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=35 C
Element 2 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=47 C
Element 3 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=31 C
Element 4 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=32 C
Element 5 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=42 C
Element 6 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=47 C
Element 7 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=36 C
Element 8 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=35 C
Element 9 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=70 C
Element 10 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=36 C
Element 11 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=35 C
The other shelf reports this differently and seems to group them logically into subenclosures, where as the above shelf puts everything in subenclosure 0.
Here’s subenclosure 3. Where’s that? I have no idea, but the fan is at 2700RPM and the temperature is 25 degrees?
Element type: Cooling, subenclosure id: 3 [ti=20]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=1
Off=0, Actual speed=2700 rpm, Fan at third lowest speed
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=0
Off=0, Actual speed=2700 rpm, Fan at third lowest speed
Element 1 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=1
Off=0, Actual speed=2700 rpm, Fan at third lowest speed
Element type: Temperature sensor, subenclosure id: 3 [ti=21]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=25 C
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=32 C
Element 1 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
UT warning=0
Temperature=25 C
What about power supplies? Well, I can see one of my shelves reports the two as unique elements, while the other (which also has two power supplies) reports them only as one.
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Power supply"
Element type: Power supply, subenclosure id: 0 [ti=1]
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Power supply"
Element type: Power supply, subenclosure id: 3 [ti=22]
Element type: Power supply, subenclosure id: 4 [ti=25]
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E -A 10 "Power supply"
Element type: Power supply, subenclosure id: 0 [ti=1]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=0, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=0, Fail=0, Requested on=1, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#
Then we see Requested on=0
on all but 1 of the 4 power supplies, despite them all saying “status OK” and AC fail =0
. What does that mean?
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E -A 10 "Power supply"
Element type: Power supply, subenclosure id: 0 [ti=1]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=0, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=0, Fail=0, Requested on=1, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E -A 10 "Power supply"
Element type: Power supply, subenclosure id: 3 [ti=22]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
--
Element type: Power supply, subenclosure id: 4 [ti=25]
Overall descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
Element 0 descriptor:
Predicted failure=0, Disabled=0, Swap=0, status: OK
Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#
The unfortunate truth here is that handlers would have to be written for each and every piece of hardware in existance in order to have functioning enclosure view for other vendor equipment.