This is a few years old, but like many improvement opportunities, this one was NOT ACCEPTED by the team. I still consider this a valid improvement opportunity, hence I am bringing it up again.
Problem/Justification
The ZFS Graphic User Interface (GUI) and Command Line Interface (CLI) could be far more helpful when it comes to a failed pool import. Specifically, my ask would be for a better indication of what disks/HBAs TrueNAS was able to detect and which are “missing”.
Impact
This information would help troubleshoot everything from hardware failures to electrical issues that prevent a pool import. Whenever a pool import fails, a “verbose” setting in the CLI or a simple listing in the GUI should show off what drives (if any) are causing a pool import failure.
This information would be super helpful to isolate a bad HBA, an electrical issue, or drive issues, for example. Some of this information may be accessible via the CLI, but even there a simple listing of what caused a pool import failure is better than what I got, which is (paraphrasing) “the pool is dead, rebuild and restore from backups” when all I had was a loose electrical connector.
There is ZERO downside to giving administrators more information that they can use as part of the troubleshooting process. On a pool import failure, the CLI and GUI should spit out the following info:
Hey, your [name of pool] pool import failed.
- The following drives are present.
- Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
- The following drives are missing:
- Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
- The following drives are reporting they have failed (using previously-collected SMART data)
- Drive Type/ Role | VDEV | SATA PORT | Capacity | Serial Number xxxxx
- The following HBA’s the system used the last time around are missing or not working
- HBA Model, S/N, PCIe Bus location
Etc.
Ideally, add some diagnostic info, such as what drives at minimum are needed to get the pool back up. How much easier would trouble shooting be if the user knew which drives were causing a pool import to fail?
Similarly, if a single drive fails to mount as part of a pool during boot, how much more helpful is reporting which drive is causing a pool to mount in a degraded fashion? Have that come up as part of the DEGRADED message - i.e. list the problem drive(s) without having to resort to hunting for commands or GUI features - i.e. for example include a link to the disks list in the GUI. Said list should make use of color to show which drives are GOOD, which ones are BAD, which ones are FAILING.
User Story
When I upgraded from TrueNAS CORE 12u6.1 to 12u7, the system had to restart and when it came back up, the pool was disconnected. Attempts to reimport the pool failed and a review of Storage/Disks table in the WebUI showed that all three mirrored sVDEV drives were no longer connected to the system.
Dropping into the CLI, an attempt to import the pool failed, with the CLI suggesting I destroy the pool and rebuild from backups. Attempts at forced imports, etc. failed with ZERO feedback (empty line).
Once I manually power cycled the three sVDEV drives, the drives successfully registered with the system and the pool could be imported flawlessly. They needed to be power-cycled to restore their interfaces, a restart of the system was not enough. Yes, this was a edge case but there is no good reason for a TrueNAS GUI to give bad advice rather than make a number of helpful suggestions that fall short of nuking the pool and starting over.
On another occasion, I have had a single point electrical failure cause a similar nuke-the-pool message. Once power was restored to the drives, TrueNAS imported the pool without a single error.