Request: Better ZFS Pool Import Troubleshooting Tools - in CLI and/or GUI

This is a few years old, but like many improvement opportunities, this one was NOT ACCEPTED by the team. I still consider this a valid improvement opportunity, hence I am bringing it up again.

Problem/Justification
The ZFS Graphic User Interface (GUI) and Command Line Interface (CLI) could be far more helpful when it comes to a failed pool import. Specifically, my ask would be for a better indication of what disks/HBAs TrueNAS was able to detect and which are “missing”.

Impact
This information would help troubleshoot everything from hardware failures to electrical issues that prevent a pool import. Whenever a pool import fails, a “verbose” setting in the CLI or a simple listing in the GUI should show off what drives (if any) are causing a pool import failure.

This information would be super helpful to isolate a bad HBA, an electrical issue, or drive issues, for example. Some of this information may be accessible via the CLI, but even there a simple listing of what caused a pool import failure is better than what I got, which is (paraphrasing) “the pool is dead, rebuild and restore from backups” when all I had was a loose electrical connector.

There is ZERO downside to giving administrators more information that they can use as part of the troubleshooting process. On a pool import failure, the CLI and GUI should spit out the following info:

Hey, your [name of pool] pool import failed.

  • The following drives are present.
    • Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
  • The following drives are missing:
    • Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
  • The following drives are reporting they have failed (using previously-collected SMART data)
    • Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
  • The following HBA’s the system used the last time around are missing or not working
    • HBA Model, S/N, PCIe Bus location

Etc.

Ideally, add some diagnostic info, such as what drives at minimum are needed to get the pool back up. How much easier would trouble shooting be if the user knew which drives were causing a pool import to fail?

Similarly, if a single drive fails to mount as part of a pool during boot, how much more helpful is reporting which drive is causing a pool to mount in a degraded fashion? Have that come up as part of the DEGRADED message - i.e. list the problem drive(s) without having to resort to hunting for commands or GUI features - i.e. for example include a link to the disks list in the GUI. Said list should make use of color to show which drives are GOOD, which ones are BAD, which ones are FAILING.

User Story
When I upgraded from TrueNAS CORE 12u6.1 to 12u7, the system had to restart and when it came back up, the pool was disconnected. Attempts to reimport the pool failed and a review of Storage/Disks table in the WebUI showed that all three mirrored sVDEV drives were no longer connected to the system.

Dropping into the CLI, an attempt to import the pool failed, with the CLI suggesting I destroy the pool and rebuild from backups. Attempts at forced imports, etc. failed with ZERO feedback (empty line).

Once I manually power cycled the three sVDEV drives, the drives successfully registered with the system and the pool could be imported flawlessly. They needed to be power-cycled to restore their interfaces, a restart of the system was not enough. Yes, this was a edge case but there is no good reason for a TrueNAS GUI to give bad advice rather than make a number of helpful suggestions that fall short of nuking the pool and starting over.

On another occasion, I have had a single point electrical failure cause a similar nuke-the-pool message. Once power was restored to the drives, TrueNAS imported the pool without a single error.

2 Likes

As you increase the set of anticipated failures and automation for these, the system response to these anticipated failures gets better. However, it comes at a price - the handling of unanticipated failures gets worse.

One of the simple examples is SMART. In almost all cases, it is quite easy to pinpoint a faulty drive if you are looking at the SMART output. So it can be reasonably scripted for these almost all cases - to properly identify the failed drive. However, the unlucky guy who has an unusual failure will get a false indication of “all normal”, which is arguably worse than no indication at all.

This only gets more fun as fault trees grow more complicated - you can think about fallback for some rare case, and the rare case improves, but some very rare case gets even worse.

Then you might arguably think, “okay to hell with the very unlucky guy who got the very rare case, let him crash”, and it turns into a political/responsibility issue, not a technical one.

1 Like

I will agree with you but only to a point.

The reason being that the import process already has a ledger of whatever drives it expects to use (by UUID) and what their specific roles are. The UUID in turn allows drives to be pinpointed by model # and S/N.

If the system starts to import pools as good when there are missing / broken drives, that’s a level of failure I doubt happens at this point. The code is simply too sophisticated for that, the edge cases you’re describing have already been uncovered.

So from my perspective the problem is a different one, ie how does TrueNAS signal to the user why the failure of certain drives to be present / functioning / etc is causing a pool import failure.

My Oyen Digital Mobius 5 DAS lacks a sophisticated UI, but does make liberal use of a beeper and individual LEDs to let me know which drive in the RAID array has failed and needs to be replaced.

My improvement suggestion is no different with the exception that it makes diagnosing failed imports quicker than the current, multi-step process while also allowing for the potentially huge, complex pools that TrueNAS can manage.

It is precisely this complexity that should call for a better resolution process for a failed import experience than “pool dead, rebuild from scratch” advice given by TrueNAS / ZFS which I have experienced now multiple times, for different reasons. Known-bad advice should not be given.

@Constantin - Don’t forget to Vote for your own Feature Request!

I do think something needs to be done. We, (the free users), are seeing more failure to import pools, some with less experienced users.

To be fair, some have really obvious causes. Ignoring failing disk until, well, it’s really failed. Using Proxmox to virtualize TrueNAS without proper pass through of device(s). Or failure of 1 disk in a striped pool.

1 Like

I like the suggestion however I doubt it would be implemented unless it helps out an Enterprise system.

Right now I believe the Enterprise system have this capability already to identify the failing drive, but I could be wrong. The TrueNAS GUI has the ability to flash the drive lights from the GUI, why would iXsystems want to include that “type” of feature into the free version?

But with that said, I do not believe the Enterprise systems provide helpful troubleshooting information such as suggested above, but does iXsystems want that? Probably not, it could impact a maintenance agreement which is where the big money comes from.

Multi-Report already will identify the drives at fault however it does not provide troubleshooting assistance. That is a much larger task than I am willing to accept.

However (just thinking out loud) I do have Troubleshooting Flowcharts for Drive Troubleshooting (@Alexey is possibly going to provide me some recommended updates in the near future) and those flowcharts presently help the average semi-knowledgeable person figure out what drive to replace, if any. At the beginning it covers ZFS Pool issues but only to define if the issue is caused by a failing drive or not. I could add a few slides on how to troubleshoot pool issues if someone provides those to me. Of course all credit will go to that person. This sounds like a good alternative for now.

Opinions?

1 Like

Thank you for the feedback. But isn’t the purpose of feature-requests to gage community interest in a feature request rather than discussing the pros and cons re: business strategy around same? If you like the feature request, why not vote for it? :heart_eyes:

I am not asking for multi-report to do this sort of trouble-shooting. Multi-Report is an awesome tool but it’s purpose is quite different from the feature request I’m making. What I am asking for requires a deep knowledge of the import process, how to parse the pool info, etc. none of which is related to Multi-report directly.

Yes, the most common reason that a pool didn’t import is that someone ignored setting up SMART tests, or didn’t act on SMART health alerts, etc. But I doubt that’s your user base? The kinds of admins who care enough about installing Multi-report are typically the kind of folk who also act when Multi-report says that something is about to break.

I agree that a flow chart is a very good step in the right direction re: resurrecting pools. At the same time, I remain an advocate for making it easier for admins to troubleshoot a pool issue by highlighting the drives(s) in question as part of the error message, not having the admin hunt through multiple menus for the necessary info.

Enterprise customers could go a step further by highlighting the right drive that needs swapping, etc. but the whole blinking by the right drive feature is something that few consumer rigs will feature unless they got a sophisticated case / backplane.

I’d be happy to add some slides in the near future if you can point me to your presentation. I’d focus on ignoring the default “nuke your pool and start over” message and instead review electrical and logical issues like I have encountered several times with my pool.

You twisted my arm. Unfortunately there are still some issues voted on which have a lot of vote (48) which still have not been adopted. Lots of interest but is has gone nowhere yet.

1 Like