Restarts after TrueNAS reaches login prompt

I’ve got an issue with a known good TrueNAS system that I’ve done a bit of an upgrade to that is doing my head in. The system gets to the login prompt and then the whole system reboots but only when the SATA SSDs are plugged in. If I boot without the SSDs, everything is fine. If I boot without the SSDs and then I plug them in then everything is fine. If I pull all the drives and put them into another machine, everything is fine. I have no idea what could possibly cause this kind of problem. But it’s probably hardware related (and possibly the HBA), I’ve tried almost every possible combination at this point but the only thing which consistently reproduces the issue is plugging in the SSDs.

Hardware list:

AMD Ryzen 5800G
Asrock B550 Pro4
128GB of DDR4 RAM at 3600MHz overclock (known good ram, known stable configuration and tested to death)
1000W Superflower PSU

There is a lot plugged into the PCI bus which is one thing I am suspicious of (that it might be one of the rails of the PSU going over capacity). PCI config:

1st slot (16x):
8x8x bifurcation with Lenovo 430-16i flashed with Lenovo’s 24.00.07.00 firmware and an Intel X710 dual 10GbE SFP+ NIC

3rd slot (4x):
NVME adaptor

5th slot (1x):
The venerable 1080Ti

Storage:
4x Seagate Exos 2x14 Mach.2 (ST14000NM0001)
4x Dell 400GB SSDs (LB406M these don’t cause any issues and do not affect boot)
1x Patriot M.2 P300 128GB NVMe (the OS drive)
1x KINGSTON OM8PGP4512Q-A0 NVMe
1x CT500P2SSD8 NVMe
And the rest (this is the pile that causes issues)
3x Silicon Power Ace A55 1TB SATA SSD
1x CT1000MX500SSD1
1x Ediloca ES106 1TB
1x Samsung SSD 860 EVO 1TB

There are two pools, one with the HDDs in two Z1 VDEVs and the two 512gb NVMe drives and a second pool of mirrored pairs of cheap SSDs of the same size.

Any ideas why it would fail at this point and why there is nothing in the logs? And why it succeeds if I just wait a minute and plug in the drives after.

I walked it through and systematically tested every SSD in the array. One of them (the ultra budget SiliconPower A55) was faulty. Instead of just degrading it caused the entire system to lock with an error. Replacing that drive has fixed the issue but it’s still a less than ideal failure mode.

1 Like