I’m experiencing a storage meltdown on TrueNAS Scale and hoping someone can tell me how to proceed. In a particular pool, all drives are red LEDs and other pools have errors too now.
TrueNAS Scale (updated like a year ago)
Case: SuperMicro 4U CSE-846 24
Backplane: BPN‑SAS2‑846EL1
MB: X9DRi-F
Intel Xeon E5-2620
80GB DDR3 ECC (enabled I think)
All drives are SATA, 21/24 bays occupied
Here’s a breakdown of events in order:
A drive started failing SMART tests, with fail count increasing steadily. I attach a replacement and resilver. I leave the old drive attached but not in a pool (dashboard keeps giving me alerts about it anyway).
Over a week later I notice a Vdev goes degraded. Dashboard says administer removed it. I move some files around as I evacuate some important data while the pool is degraded. Then I power down and reseat the drive. At this time, I also check connections and remove the aforementioned SMART failing drive. Powering back on expecting to resilver again, the “missing” drive is present, but there are error-labeled drives all over the pool. Another restart later and some drives are even labeled as Faulty. Soon errors start appearing on other pools.
Inside the bios, Avago Technologies Config Utility, in SAS Topology, it shows my LSI SAS2X36 (the expander chip) under Device Identifier. It shows me a list of bays, however every 10 seconds it (very briefly) blanks and flashes “Refreshing display, please wait…”. I’ve detached every drive and tried it with only one drive, and the refresh just happens faster than a blink. All drives in that vdev are are showing red lights on the case. Is the backplane the culprit now? Or is there something else going on?
I replaced the SAS cables, then replaced the HBA. Same result. I’ve removed the backplane and I can’t see anything suspicious. Molex connectors all look/smell fine on both ends. Can’t see a bulging capacity or sense a burnt smell. This failure happened during the coldest months here.
How should I proceed? I’m worried about a faulty backplane poisoning drives or feeding irregular power and killing them. They are older drives, around 5 years old, but I can’t think all of them are deciding to die at once. I’ve ordered a replacement backplane (BPN‑SAS3‑846EL1), but after that not sure how to proceed without doing something dumb and losing data.