View Enclosure Screen for non-iX hardware

zmweske · September 9, 2024, 1:22am

Problem/Justification

(What is the problem you are trying to solve with this feature/improvement or why should it be considered?)

It would be great to be able to create custom view enclosure screens for non-iX hardware. It might be possible to create a simple temporary solution where existing iX chassis can be used as a filler. I don’t know if the enterprise solution has features that can show drives with errors/etc, but even if drive serial numbers are manually entered by the user, it may be relatively simple to implement. The “user story” below dictates a more suitable long-term solution for custom enclosure design.

Impact

(How is this feature going to impact all TrueNAS users? What are the benefits and advantages? Are there disadvantages?)

This would be a great addition to the community edition. It would alleviate the need to keep a separate spreadsheet with serial numbers and such. It would be a simple and clean way to view devices while also allowing for a much better way to view failing or errored drives.
The only disadvantage I can think of with this would be added complexity. For full compatibility, it would be great to be able to customize front, back, and top load chassis with variable grid sizes/orientation, and drive sizes (2.5/3.5/NVMe maybe?). The images used could be basic images or just other iX chassis.

User Story

(Please give a short description on how you envision some user taking advantage of this feature, what are the steps a user will follow to accomplish it)

Below is taken directly from the Jira Feature Request, posted by a deactivated user so I am unable to give credit.

I know that being able to use the full enclosure feature required specific hardware, but would there be a possibility to allow users not using that hardware to make use of the same display widget but have a section in the configuration somewhere that you could specify which drive was in what slot and also specify how your slots are laid out.
For example, I have one server running in a Norco chassis that is 6 rows and 4 columns for a total of 24 drives. So if under storage there is the enclosure widget and you selected “View Enclouse”. You would be presented with the details and a settings cog would be in the top right corner that you could specify that you want to use “Manual Enclourse Management” which would be a checkbox or similar.
This would then let you specify either front only, front and back or top-load add the rows and or columns for each configuration. You would then be able to add the appropriate information and associate it to a disk on the system. Ideally, once this is done it would provide some of the enclosure functionality that is used on the supported hardware but at the very least would give administrators a way to document and reference what drive is in what slot should they need to manipulate a drive or drives in and out of the system.
There is defiantly more to this than just making some pages and a table but I wanted to try and express the idea in more detail, though I don’t know if this is able to be done or if it would be a feature most people that do not use enclosure hardware would want.

For additional information, context, and a graphic mockup created by another user on GitHub, check out discussion 6771 (Devices Page Mockup · truenas/webui · Discussion #6771 · GitHub)

joeschmuck · September 9, 2024, 9:48am

While I think that would be a nice handy visual tool, I also have doubt that iXsystems would do that in favor of selling their own systems as that is a feature they have as a selling point.

Meanwhile, for the rest of us, there is a method other than using a spreadsheet in TrueNAS and it has been around for a long time. From the CORE GUI Storage -> Disk -> Edit and edit the Description to enter the location. I have to do that with my NVMe system as the serial numbers are on the backside of the drive so they are not visible unless you remove them.

Protopia · September 9, 2024, 7:17pm

If ixSystems hardware doesn’t have enough distinguishing features to sell itself, I doubt that Enclosures being unique will close that gap.

Besides which there are going to be ample opportunities for Enclosures to be better for ix hardware than general hardware i.e. auto-configuration, hardware integration with e.g. error lights etc.

Davvo · September 9, 2024, 9:11pm

We got answers in the past that iX is not willing to develop this kind of feature due to the difficulty of reliably implementing it.

NickF1227 · September 11, 2024, 2:16am

I think the biggest challenge here is the non uniformity of hardware.

For folks with real SAS implementations sg_ses does exist for this kinda thing, folks with Supermicro 2/4U servers as an example can flash the lights on a drive with those tools. But alot of folks don’t even have SAS backplanes, and none of this works with SATA.

Then let’s wrap in Dell/HP/Cisco/Lenovo whose back planes may not be as straightforward and may have proprietary communications for these types of functions.

I had some HP servers which were 12 drive LFF but only had a single 8087 (4 lane) SAS connector and a SAS expander. Sesutil would flash the wrong drives or wouldn’t even light any drives when I tried to use it.

A similar story can be said for disk shelves. You should be able to query voltage, power supply status, fan speeds, etc over an external SAS cable. But vendor implementations differ.

Here’s a couple of enclosures for example.
The output mentions several unique elements (array device slots, power supplies, cooling fans, and temperature sensors).

root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Element [0-9]+ descriptor" | wc -l
59

If I compare that to my EMC enclosure, I see almost double the values that TrueNAS would have to parse and scrape.

root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Element [0-9]+ descriptor" | wc -l
108

Let’s hone in something easy, say temperature. I can see that both shelves report temperatures in similar ways, and have the same amount of temperature sensors.

root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Temperature="
        Temperature=37 C
        Temperature=37 C
        Temperature=27 C
        Temperature=27 C
        Temperature=32 C
        Temperature=23 C
        Temperature=25 C
        Temperature=32 C
        Temperature=25 C
        Temperature=23 C
        Temperature=27 C
        Temperature=23 C
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Temperature="

        Temperature=33 C
        Temperature=35 C
        Temperature=47 C
        Temperature=31 C
        Temperature=33 C
        Temperature=43 C
        Temperature=46 C
        Temperature=36 C
        Temperature=35 C
        Temperature=70 C
        Temperature=36 C
        Temperature=35 C
root@rawht[~]#
root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Temperature=" | wc -l
12
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Temperature=" | wc -l
12

But if you look at the full output of the temperature, it’s really not very clear what or where that temperature is. One of these sensors reports being 70 degrees. Should I be worried? I have no idea!

      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=33 C
      Element 1 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=35 C
      Element 2 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=47 C
      Element 3 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=31 C
      Element 4 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=32 C
      Element 5 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=42 C
      Element 6 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=47 C
      Element 7 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=36 C
      Element 8 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=35 C
      Element 9 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=70 C
      Element 10 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=36 C
      Element 11 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=35 C

The other shelf reports this differently and seems to group them logically into subenclosures, where as the above shelf puts everything in subenclosure 0.

Here’s subenclosure 3. Where’s that? I have no idea, but the fan is at 2700RPM and the temperature is 25 degrees?

 Element type: Cooling, subenclosure id: 3 [ti=20]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=1
        Off=0, Actual speed=2700 rpm, Fan at third lowest speed
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=0
        Off=0, Actual speed=2700 rpm, Fan at third lowest speed
      Element 1 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, Hot swap=0, Fail=0, Requested on=1
        Off=0, Actual speed=2700 rpm, Fan at third lowest speed
    Element type: Temperature sensor, subenclosure id: 3 [ti=21]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=25 C
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=32 C
      Element 1 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=25 C

What about power supplies? Well, I can see one of my shelves reports the two as unique elements, while the other (which also has two power supplies) reports them only as one.

root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E "Power supply"
    Element type: Power supply, subenclosure id: 0 [ti=1]
root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E "Power supply"
    Element type: Power supply, subenclosure id: 3 [ti=22]
    Element type: Power supply, subenclosure id: 4 [ti=25]

root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E -A 10 "Power supply"
    Element type: Power supply, subenclosure id: 0 [ti=1]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=0, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=0, Fail=0, Requested on=1, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#

Then we see Requested on=0 on all but 1 of the 4 power supplies, despite them all saying “status OK” and AC fail =0. What does that mean?

root@rawht[~]# sg_ses -p 2 /dev/sg14 | grep -E -A 10 "Power supply"
    Element type: Power supply, subenclosure id: 0 [ti=1]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=0, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=0, Fail=0, Requested on=1, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#

root@rawht[~]# sg_ses -p 2 /dev/sg27 | grep -E -A 10 "Power supply"
    Element type: Power supply, subenclosure id: 3 [ti=22]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
--
    Element type: Power supply, subenclosure id: 4 [ti=25]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Do not remove=0, DC overvoltage=0, DC undervoltage=0
        DC overcurrent=0, Hot swap=1, Fail=0, Requested on=0, Off=0
        Overtmp fail=0, Temperature warn=0, AC fail=0, DC fail=0
root@rawht[~]#

The unfortunate truth here is that handlers would have to be written for each and every piece of hardware in existance in order to have functioning enclosure view for other vendor equipment.

Captain_Morgan · September 16, 2024, 4:11am

That’s accurate. If there was an industry standard solution, we would adopt.
As @NickF1227 points out, the “non uniformity” of hardware is challenging.

When we do it for our own appliances, it takes effort for each appliance, but we sell 100s of the same units and get a return for that effort.

The Edit Disk option allows information about location to be entered.

zmweske · September 16, 2024, 4:01pm

Thank you for your detailed and in depth reply, however the main thing I am referring to is a graphical representation of the location of the drives. Additional features like power draw or whatever else is well beyond the basic use case of what this is referring to. This would basically encompass a graphic grid with green boxes or yellow or red boxes depending on the state/health of the drive. Drive location would be manually entered but then mapped to the graphical mockup. If a drive is starting to fail, you can see inside the GUI which position it is in. The only automatic part of the interface would be SMART health and nothing else, which is standardized. Whether or not you should be able to query voltage and power supplies and or whatever else is far beyond the scope of this feature request. The mockup shown in the GitHub link shows the simple nature of this GUI feature rather than a fully integrated API.

Davvo · September 16, 2024, 4:35pm

To me those are basic features compared to a graphical representation. Different outlooks.

Protopia · September 16, 2024, 4:45pm

I agree - a generic enclosure needs to have only the following IMO:

One or more enclosures, each of which can be named by the user (for those large users who have multiple JBOD racks).
Within each enclosure, one or more slot groups each of which has a 2D size (in slots horizontally and vertically) and a horizontal/vertical flag to show the orientation of the slots.

With the above two requirements you can create a graphic representation of the physical disk layout.

For each slot, a manufacturer/model/serial number field (which can be populated using a dropdown), and once the serial number is populated, then the device name, a SMART status indicator “light” (green,yellow,red - and possibly some other bigger warning messages when pool status is not good), a pool/vDev label (showing pool name, vDev name and vDev type), and possibly controls for e.g. starting the process of removing or replacing the drive from a pool.

IMO these are the basic requirements, and requirements which can be developed once and which should pretty much cover the needs of almost all users, small, medium and large. However these are the personal views of one person, and others may have additional requirements they would like to see - and if so please add them.

zmweske · September 16, 2024, 5:44pm

You did a much better job of spelling out and trying to explain what it would look like than I did, thanks. I pretty much completely agree with the way you described it- its main functionality besides some basic mapping would just be the graphical representation.

etorix · September 16, 2024, 6:17pm

You realise that is a non-insignificant work from the user… and to be redone everytime internal cables are moved, don’t you? In my opinion, this manual work is where the whole thing breaks up—and those who understand they have to draw and maintain the map are exactly those who already have visible serial numbers on each drive and/or a spreatsheet with bay labels, serial numbers, date of purchase, and invoice number nicely filled up.

Protopia · September 16, 2024, 7:55pm

In my proposal the user would need to:

Say how many disk enclosures they have - for the majority of users this will be “1”.
Say how many bays the enclosure has (for most users this will be e.g. 5 x 1 - 5 bays across, 1 bay high. The default orientation would be vertical if one high, otherwise horizontal, but the user could change this.

TrueNAS would then create a graphical representation of the bays and request the user to define the serial numbers of the disk in each bay.

If the disks are already in the bays when the system was powered on, TrueNAS will present a dropdown list of disks to select from for each bay. Or the user can manually enter the serial number for the disk.

TrueNAS knows from querying each drive what the serial numbers are, and so can work out what the mapping is (for this boot) between devices and slots i.e. which device /dev/sdX relates to each slot - and if you power off the system and switch the cables around, when you reboot TrueNAS will work out this mapping afresh.

This doesn’t seem to be a significant one-off effort on the part of the user.

I guess a requirement I missed is for TN to flag up if a serial number in a slot is not present so the user can fix it, and for TN to handle one-by-one drive changes done through the UI.

zmweske · September 16, 2024, 10:19pm

Why would this be redone any time internal cables are moved? And why are you rewiring your server so much? Either way, it would have nothing to do with how it is physically wired- it would be a manual mapping of serial number 1 to bay 1, serial number 2 to bay 2, and so on. If drive 1 with serial number 1 has some smart errors or faults, it would display those errors in the GUI in the spot for bay 1. You can easily see what physical drive has those issues by looking at the GUI. Then you can replace that drive easily and enter serial number 3 in the spot for bay 1 and visually see the mapping of the health of all your drives.

I already have a spreadsheet with basic info about my drive layout, but the whole point would be to not have to maintain that and to be able to see visually with the GUI the health of everything. Additionally, some servers, like my own, don’t make it easy to see labels on the front of drives unless you open the drive caddy and remove the drive, which makes the whole thing pointless.

How many drives do you have that it would be a non-significant amount of work that happens once? I currently have 12 drives and it would probably take less than 5 minutes to do to that initially. Then it wouldn’t have to be touched again until a drive fails.

Lastly, if someone is keeping track of purchase orders and invoice numbers with each drive, I would argue that it is likely that they are probably operating at the level where they would use an enterprise iX system regardless. If not, they probably have enough that it would help them to see what enclosure and bay has a failed drive rather than going back and forth double checking the serial numbers anyways.

NickF1227 · September 16, 2024, 10:41pm

If you are requesting something that requires the user to manually input the data as to what-drive-is-in-what-slot then I have less of a technical concern and more of a practical one. The whole reason why Enclosure View is such a nice Enterprise feature is because you do not need to have prior knowledge of what hard drive serial number is in what drive slot.

Having to do it manually would be far less useful (with a high risk for error), and such a need is better served in a single pane of glass of your entire infrastructure with tools like Netbox, which will let you document all of the assets you have in one place.

Thats not to say that this isn’t a valid request, I just don’t personally see the value.

zmweske · September 17, 2024, 11:53am

Fair enough, I think I would disagree. Also, does NetBox support any SMART monitoring? I didn’t see anything about that after some quick searching, but I could be wrong. I’ll probably check it out at some point regardless cuz it seems neat, but showing SMART stats is one of the biggest selling points for the feature request on my end personally.

It also comes down to the fact that the enclosure view for iX hardware would still have features at the enterprise level, maintaining a level of differentiation between the free and enterprise versions. Having it automatically populate would be a benefit of purchasing official hardware, but using the free version would still have some level of mvp available, even if it isn’t 100% convenient and automatic like in the enterprise version.

Protopia · September 17, 2024, 12:54pm

Enclosure View is not an enterprise feature - it is a feature of all TrueNAS SCALE installations that is only visible when you are running it on a supported iXsystems hardware system.

That said, I do appreciate that for iX hardware, it is a nice feature, but for anyone running TrueNAS SCALE on e.g. a general purpose rack server with JBOD disk shelves, it won’t appear.

Netbox is primarily aimed at documenting networks. To quote from the documentation introduction “NetBox is the leading solution for modelling and documenting modern networks.” As far as I can tell from the Netbox documentation it does not support documentation of disk storage physical layouts.
As far as I can tell from the documentation nor does it support any form of monitoring (not even network monitoring much less SMART monitoring of disks on a box that happens to be on its network).
As far as I can tell from the Netbox documentation, it does NOT include any form of autodiscovery. To quote from the documentation “The simplest and most direct way of populating data in NetBox is to use the object creation forms in the user interface. [i.e. Manually] … NetBox supports the bulk import of new objects, and updating of existing objects using CSV-formatted data.” So apparently, not even any form of network autodiscovery. TBH, for homes / family businesses / small business, network Tools like Fing will do a better job of autodiscovering and documenting the network topology.

Thus, to suggest that the problem with my suggested Enclosure View is that it is overly manual to configure and that Netbox is a better solution doesn’t appear to be supported by any form of evidence.

So the reality is that for non iX hardware of any size, small or large, you need to maintain some form of records of what disk is in what slot, and then hope that when something goes wrong then with this reference record you can manage to work out the right slot and pull the correct drive rather than getting it wrong and degrading your pool further or perhaps even turning your RAIDZ1 pool into toast.

IMO, a generic Enclosure Screen that you populate when you build your NAS and which then gives you a visual indicator of which drive to pull has got to be a better solution than a spreadsheet.

NickF1227 · September 17, 2024, 4:01pm

So are all of the enterprise features. They merged TrueNAS and FreeNAS code bases many years ago. All of the Enterprise features are in every TrueNAS SCALE installation but only work on Enterprise hardware. I’m not sure why we’re splitting these hairs?

Its primarily aimed for documenting infrastructure, including networks, servers, racks, pdus, everything, including visualizations of the entire rack.

Because it’s a tool to document your infrastructure, not a monitoring tool.

This is entirely my point. If you are proposing a feature which requires manual documentation anyway, I feel that it would be more appropriate to document the physical inventory of your servers inside of a software designed for physical, manual inventory.

We’re allowed to disagree here, this is just my opinion. No hard feelings my dude. IPAM/DCIM tools have existed separately along side SIEM in the Enterprise for decades. This isn’t a whacky obscure reference to suggest Netbox.

DjP-iX · September 17, 2024, 7:27pm

As long as we’re pedantically splitting hairs, there is a minor difference in that most Enterprise features are enabled/disabled based on the presence of an active Enterprise license, while the Enclosure screen is accessible from a non-licensed system as long as it is iX hardware. For example a Mini without an Enterprise license still has the Enclosure screen.

(I don’t think this distinction really affects either of your positions fwiw)

Protopia · September 17, 2024, 8:38pm

And my point is that an Enclosure View can provide monitoring status whilst Netdata cannot. That is the entire point in a nutshell as to why you SHOULDN’T use Netdata instead of Enclosure View.

@NickF1227 I am not sure why YOU are splitting hairs. I am not splitting hairs because Enterprise licenses are chargeable, the Enterprise features are only available if you pay for an Enterprise License and Enterprise Support, whilst the Enclosure View is available for free WITHOUT an Enterprise License on any iXsystems hardware box. And that is a non-trivial distinction.

However, it is becoming clear to the rest of us that you have your faith-based view that Netdata is the solution to everything, and it is not for me to try to shake your faith by using … um … facts.

NickF1227 · September 17, 2024, 8:50pm

I think we can just agree to disagree here. This is really not a hill I’m willing to die on. Just sharing my opinions based on my understanding of the situation and my professional experiences in the field. I’ll leave this thread for discussion from other folks.