TrueNAS Scale is unhelpful when replacing failed disks

emsicz · May 13, 2024, 2:18am

I’d like to preface this by saying it’s 4 AM where I’m at, I’ve mentioned this couple of times over the years and I keep being shut down by TrueNAS veterans, and I never understood why.

I have a failed drive(s) in TrueNAS Scale box. I think there is 9 physical disks in there. Some have recently failed.

There is currently no view, where I’d see devices connected to the system and devices assigned to a pool, so I can clearly tell which devices were disconnected from the pool (for whatever reason) and replaced by hotspare unit. All I can see is a table that gives me list of some of the disks connected to the system with serial number and name. So I can see that drive unit with serial 1EGYE2LN is sdj and part of tank pool. No indication of what is broken, missing, replaced, spare, nothing. Just list of disks. The list is incomplete, there is more devices connected to the system than is visible in this table.
I have received email that says that device has failed and that the failure might be related to device known as sdi. Nothing in the GUI navigates me to this. The device table has no sdi device on any line, presumably because the disk is dead.
If I want to act on this, I have to go into shell (wtf) and type zpool status, which shows me some incomprehensible list of guids, which is completely useless and I have nothing to do with it. But it does tell me there is a disk with status UNAVAIL. I have no way to tell what serial number the disk is, tho.

I have this box backed up, so I’m giving this up and I’m going to bed. What I am asking is how is it possible, that a NAS software, after decades of development, does not have completely idiot-proof GUI to guide me through the one thing that I am expected to do over the life span of the NAS box? Why is this so hard to accomplish? Why does it just not tell me that unit with serial number of 1EH24DNN is failed and to click a fking button when I’m done replacing it? And the answers being given here are totally unusable also. What use is it to me that I have a physical note of what serial is where in the box? Why do veterans keep poking people with this remark? Yes I do have a damn sticky note that tells me where everything is, but what use is it to me when the damn TrueNAS can’t tell me which serial number is it that has gone bad? Why am I presented with completely impotent UI that keeps telling me weird useless crap that I either already know or have no way to act upon? What is the philosophy behing this approach? I can see with my own eyes there is 9 disk units up and connected in this box. The Disks table shows 7. The zpool status shows 7 with one unavailable. The Windows Storage Spaces is like the worst tool I’ve ever come across to manage data but when a disk fails, the process to replace it is incredibly convenient and transparent, they were able to nail this on first attempt.

Captain_Morgan · May 13, 2024, 5:57am

My system has a disk info report if I look at the devices.

Is the issue that the disk is dead and no longer reports its serial number?

chuck32 · May 13, 2024, 6:00am

Did you go to page 2?

Storage and then on the pool there’s a button “manage devices”

There you can see your configuration and should also see the serial number in the box on the right.

After all it’s an enterprise solution, using the CLI for some aspects of admin work it can be reasonable expected that the person hired to work with Truenas can use the shell.

dan · May 13, 2024, 10:21am

Yes, but if the device is dead, how is the system to know that it’s connected? There’s no way for Linux to know it has a potato plugged into a SATA port, and certainly not for it to determine the potato’s serial number. The view you mention does show you every live (even if failing) device connected to the system, but not those that have rung down the curtain and joined the choir invisible. Whether that’s “incomplete” is, I think, a matter of opinion.

But what it sounds like you’re asking is that the pool status page show the serial numbers, not only for drives that are currently pool members, but also for those that were pool members and have been taken offline (either by direct admin action or by failure). Interesting, and not a suggestion I recall hearing before, but I can see how it’d be beneficial. How feasible it would be is something I’d leave to the Captain and his crew.

But, of course, there’s always good old-fashioned process of elimination. You know the serials for the not-dead disks that are part of your pool, so whichever disk’s serial isn’t on that list is going to be the dead one. Tedious, but simple enough–and with only eight disks in the pool, not even that tedious. You could probably sort it out by some grepping in the kernel boot logs as well, but that’s definitely getting more fiddly.

Because such a thing isn’t possible–the universe keeps making better idiots.

That you don’t like them doesn’t make them unusable. Leaving aside the “I shouldn’t have to do this” factor^[1], a separate list mapping serial-to-location, combined with what the GUI tells you about the live disks still in the pool, gives you all the information you need to pull the right failed disk (and only the right one) and replace it.

Because we’re about working with the product as it is, not about working with the product as we wish it would be. We aren’t devs. We can’t change the product. We can suggest changes, and have iX sit on those suggestions for eight years without action (in some cases, e.g., web-based file browser). Or we can tell you how to work with the product as it is to accomplish what you want to accomplish.

I don’t think this is reasonable here–TrueNAS is designed to be managed through the GUI. If the GUI is inadequate for basic management functions (and identifying and replacing a failed disk is a very basic management function for a NAS), that’s a failure on the part of the product. Whether there’s such a failure here is a separate question.

which may even be valid ↩︎

chuck32 · May 13, 2024, 10:53am

Hence I added the link to the documentation, as far as I’m concerned it’s manageable via GUI in that aspect. I just wanted to state my personal opinion, that I don’t find it unreasonable to have to fallback to CLI for some aspects of Truenas management.

I stand to be corrected, but I think getting extended output of the smart results is a feature that seems to be CLI only for example. At least I didn’t find that in the GUI. One could argue though that even that should be pull able from the GUI.

dan · May 13, 2024, 12:03pm

It is, although at least some SMART data is now available through the GUI. I think the GUI gives you enough that you don’t absolutely need the CLI (it’ll warn you about temps, bad sectors, and failed self-tests, which is probably all you really need), though most of us are still going to ask for it.

But I think I’d also take issue with the idea that, since TrueNAS is an enterprise product, the CLI should be expected, and that’s because TrueNAS isn’t just an enterprise product. Wherever the paying customers might be (and recognizing that we’re not them), they’re clearly targeting a broader market than the enterprise. One of the clearest indicators of that, I think, is the plugins/apps ecosystems–those surely wouldn’t be used in the large enterprise sector; they’re aimed at the home and perhaps SMB markets.

You’re right that there are points of managing a TrueNAS system that pretty much require the CLI–but it’s IMO a legitimate objection that it shouldn’t be that way, at least for routine management.

chuck32 · May 13, 2024, 12:17pm

Thanks for pointing that out / giving me another point of view!

etorix · May 13, 2024, 12:56pm

If you’ve bought a hardware system from iX the GUI actually shows the bays and what’s in there.
If you’re using the free TrueNAS with a home-built NAS, TrueNAS has no way to know what’s your chassis and how drives might be wired so you’re left managing this all by yourself.

emsicz · May 13, 2024, 2:20pm

I had to double check I’m not typing in Chinese or something. I created this entire setup from scratch, the system knew there were 9 units present, I dragged them to pool. 8 were in a raidz2 vdev, 1 was a hot spare. This constellation was confirmed, formatted, resilvered, the whole thing. Now I walk up to the thing and TrueNAS tells me there is 7 units connected, of which 1 is unavailable. I complained about 2 things:

The system behaves like the other two disks never even were. They simply disappeared.
Instead of communicating serial numbers consistently. Instead, it gives me serial numbers on screens where it doesn’t say if there is anything wrong. And then it gives me some uncomprehensible guids on screens where it does indicate what unit is problematic, but doesn’t tell me how to join the guid to the serial number.

Like, what is this. How are you not getting it. Am I not explaining myself clearly? The system knows full god damn well what drives were in it when it booted up, now it’s missing 2 and knows 1 other is broken and your response is “well they are not connected so how would the system know they are missing”. THE SYSTEM OPERATES THE FKING POOL, how does it not know that 3 entire units of this pool are gone and doesn’t tell me in plain fking English which ones they are? Why is this such unobtainable magic to keep track of devices involved in the pool?

dan · May 13, 2024, 2:49pm

Cut the attitude. Your post was reasonably clear, and my post reasonably responded to it, as I further do below. Now you’re flying off the handle.

The system doesn’t show what used to be connected to it, it shows what’s connected right now. If two disks have completely died, the system has no way of communicating with them, and therefore, as far as it can tell, those disks aren’t connected any more. You thus aren’t going to see them in the list of disks.

Now you have an important question to consider: do you want help resolving this issue, or do you just want to rant? If the latter, well, knock yourself out, I guess. If the former, some actual information would be helpful. Screen shots of the disks and pool status pages would be a good place to start.

HoneyBadger · May 13, 2024, 2:58pm

Not to speak for @emsicz but I believe that’s the issue. When a drive is failed or removed from a pool, it doesn’t show the “previous serial number” in any topology view.

In the example below, I actually pulled a disk (yes, from a RAIDZ1 - don’t worry, there’s nothing irreplaceable on it) and it shows as absent/unassigned.

winnielinnie · May 13, 2024, 3:07pm

I think there are several things to be considered and understood:

The GUI for enterprise TrueNAS has visible indicators that relate a disk to its bay. This allows you to easily pinpoint exactly which drive has failed.
The GUI for non-enterprise has no such thing. It is up to you to match a vdev member (kernel ID, GPTID, PARTUUID) to the corresponding serial, which will then allow you to visibly locate (inside the chassis / case) the drive in question, by reading the serial numbers on the stickers.
Different systems build / import pools with different naming conventions. TrueNAS Core uses GPTID. TrueNAS SCALE uses PARTUUID. Some systems use MODEL-SERIAL. Some systems use the kernel IDs.

Maybe it would benefit some users for TrueNAS to build / import the member devices in a pool’s vdev based on MODEL-SERIAL? (Because the pool’s status would show “last seen as” for the failed / missing disk.) However, this might introduce unforeseen problems, such as collisions in serial numbers (e.g, USB caddies) or devices that don’t report a serial number.

HoneyBadger · May 13, 2024, 3:19pm

Using MODEL-SERIAL could preemptively foot-shoot things like partition-based pools, NVMe namespaces, multi-actuator support … none of that is officially supported now but it would introduce those problems that don’t exist with GPTID/PARTUUID, and I personally know people are doing these things now.

@emsicz Genuinely do you have a suggestion/mockup of what you’d like the UI to look like to indicate this failure? We can’t show physical drive location on non-iX hardware because we don’t know what’s physically mapped where, but would it be as straightforward as “show the ‘last known S/N’ of a missing/failed device in the UI”?

winnielinnie · May 13, 2024, 3:28pm

I think this is feasible, since TrueNAS already saves the serial numbers of attached disks in its database. (Currently and previously attached.)

In my .db file, there are serial numbers for previously connected/disconnected USB drives.

I’m not sure the logic behind how TrueNAS decides to save old serial numbers. (Maybe if you don’t explicitly “remove all associated shares” when exporting a pool, it will retain the serial numbers for all relevant disks, even after exporting the pool?)

Protopia · May 13, 2024, 4:18pm

In essence because ixSystems are not idiots - as you have pointed out they have decades of experience in ZFS and indeed have made major contributions to Open ZFS code over those decades - and because of this experience they know that:

It is impossible to predict all the thousands of different issues that could crop up with a file system - or indeed millions of combinations of issues.
Creating a UI that can handle the workflow of each of these millions of different issues is going to be at least thousands of times as much effort as coding for the normal workflows.
That it is therefore impossible to diagnose with certainty what the issue(s) are and what you should do about it.
The last thing you want the UI to do is to make the wrong recommendation and make things worse and turn a recoverable situation into an unrecoverable one.

Finally, we are glad that ixSystems have been kind enough to provide you with this forum to rant and let off steam (rather than e.g. taking a hammer to your NAS in frustration LoL) but you might find you get more help with your issue if you post screen shots and command line output and generally say what you have already done and ask what you need to do next.

Protopia · May 13, 2024, 4:21pm

@HoneyBadger

Regarding a UI, how about somewhere that you can say what shape array of drive bays you have (and whether they are horizontal or vertical - generating a UI showing those drive bays and numbering / naming them (you might want some options about how they are named) - and an ability to then map a serial number to a bay.

(Most NAS appliances / servers have a rectangular array of slots. Occassionaly you might have more than one rectangular array.)

(E.g. I have a 5-slot TerraMaster appliance - so I would configure a 1x5 array of vertical slots and want my slots numbered 1-5. A Dell rack server might have an array 2 high and 6 wide of horizontal slots. etc. A tower PC might have a stack of 5 horizontal slots behind the fan and another 4 vertical slots at the front. ixSystems appliances can switch to having predefined arrays and since you know how they map to devices pre-populated serial numbers. You could even provide the community with the ability to add predefined numbers of slots and mappings of sdx to slots for known non-ixSystems hardware.)

Once you have this done then - assuming that the error is clear-cut - you can automate telling a user to e.g.:

Power down the NAS.
Remove disk in e.g. slot 5 and replace it with a new one.
Power up the NAS box.
Authorise TrueNAS to resilver using the new disk.

Of course, many or most errors are not clear-cut.

Protopia · May 13, 2024, 4:36pm

On reflection there might be a problem with storing the mappings between serial numbers and slots.

I suspect that where the SATA connections are fixed (like in my laptop and on my NAS unit) you want to map sdx to a slot i.e. you might ask the user what serial is in each slot but you use this to determine the sdx to slot mappings.

Where SATA connections are cables, then that mapping can easily be changed by swapping cables around - so a physical mapping may NOT be appropriate and better to retain the mapping of serial number to slot.

neofusion · May 13, 2024, 4:43pm

I also think the alerts could be more descriptive or even maybe include a link to give you more information.

Since I currently have no failing disks to demonstrate this with (knocks on wood) I used a google image search to grab some alerts from unknown TrueNAS installs, they might not be accurate of the current wording:
uhoh1
uhoh2

Since device names are volatile, why use them in alerts? They may not be representative of the current situation if the system has rebooted since the alert was generated. Acting on an old alert might cause a user to pull the wrong drive. Note especially the last image that shows an alert that uses a more descriptive way to describe a failing device and in the next message goes with device name, I much prefer the first wordier variant. It might also be useful to provide a link directly in the alert that takes the user to a more detailed view of the failing device.

Another thing I’ve noticed on my Cobia 23.10.2 system is that the “S.M.A.R.T. Test Results”, the one that lists previous tests, is of marginal use.

First of all, why no dates? If something caused the tests to stop being run 3 months ago, how would I see that by looking at this page unless I happened to know the current Lifetime by heart?
The IDs are inverted, the newest test is at the top yet has the lowest ID. Am I the only one who thinks that’s is unintuitive?
The Lifetime column, which for now is the only way to figure out if you’re looking at new or outdated tests, isn’t shown by default.
Very little actual info, just SUCCESS or (presumably) FAIL. Perhaps add a button to show a current detailed (-a or -x) SMART result if the user wants to examine a drive more thoroughly without having to dive into a shell?

Captain_Morgan · May 14, 2024, 5:38am

Its certainly an interesting topic… how to best identify and replace failed drives.

If someone wanted to create a best practice resource, that would be very useful. From there, there might be an obvious improvement that could be made to TrueNAS.

As was indicated for the Enterprise solutions, we know the exact hardware and can manage the enclosures, so we can identify slots well. There is no need for a user to track serial numbers.

emsicz · May 14, 2024, 7:58am

For starters a nice legend of what has happened with my pool would be helpful. I don’t care about what is currently connected. I don’t care about guids. I don’t care that if name/serial is used now, it could present issues in some obscure configuration that isn’t even implemented today. I knew what the array looked like when I set it up 3 years ago, it had 8 units in 1 vdev in raidz2 and 1 hot spare unit. I want to see THIS exactly - 9 units where each sits in its own little container, either being used as data or as hot spare. Where are they? Are they connected? Are they used? Are they failed? Failed how? SMART failed? Or completely dead failed? Have they been decomissioned, because they were deemed dead and now they are connected again but pool isn’t using them? In Storage Spaces, if anything happens to a unit, you’ll see it in the overview. Either it’s completely gone, or it’s in “Failing” status, meaning it works, but could be reporting SMART issues, or Windows has detected issues trying to read/write. As I have mentioned before, I have 3 disks dead out of 9. TrueNAS hides first two as if they never existed, but they are connected and running. Third just said “unavailable.” When I pulled them out, I found there is nothing wrong with them. SMART is clean, they work, benchmarks passed. I de dusted the system, replaced miniSAS to SATA breakout cables preemptively. Now when I boot the system up, TrueNAS still doesn’t show them, because they were retired and are now in some sort of void where TrueNAS pretends they don’t exist. Like, WTF is that. How do I reintroduce these perfectly good drives back? These lapses in implementation do not necessitate or require some visual tool to stack drives or anything complicated like this. You’re basically talking about

ID (and I mean ID by serial number, it’s the easiest most straightforward identificator that every HDD has). We currently do not have IDs.
Current and historical state relative to when the pool was created. We currently don’t have that either.
Options to manipulate disks as they become broken. This is also not currently implemented.

Secondary, TrueNAS’ behavior is totally unpredictable. Sometimes it just sends out notification, other times it automatically starts rebuild/resilver. All it tells me is bunch of emails none of which clearly say what triggered them and what is being done about it. Those email notifications are ready for complete overhaul.

And thirdly, retiring a broken disks must be dead-stupid simple. I am baffled by this. TrueNAS is a product released to the public and disks dying is 1) a completely expected event in a NAS and 2) a terrifying thing to happen to anyone who doesn’t work with NAS ops 24/7. Instead, I am given some cryptic emails, inability to identify exacty which unit has died and even if I do figure it out somehow, I can’t retire the disk. There is no button that I can press to say “I have pulled this disk out, and I don’t have any spare, so shuffle everything around to restore resiliency” or “I have physically replaced the disk with another (let me pick) and now you can start resilvering”.

It’s one thing that iX isn’t stupid, but another that if they had spent the time and effort implementing this instead of repeatedly saying “well this is how we use it and if you don’t like it don’t use it”, it would’ve been implemented already. It’s not like I’m asking for new features, I looked into their open source code. Someone familiar with the code base could probably whip up what I described in an afternoon, hell I would probably directly fund this effort if it means I don’t have to scratch my head at 4AM, after spending 2 hours trying to identify which fking disk is the problem.