RCA attempt for unexpected reboots and subsequent pool corruption

Hi,

I had a lot of fun yesterday (not) when moving from “Find out why box crashes on data deletion” to a “Truenas crash on boot” loop fix.

Long hours turned short, it seems that the metadata devices on my main pool have some issues, so heavy io to them (data deletion) caused system reboot.

Multiple attempts in isolating the root cause (swapping hba, places in the cahssis/backplanes, psu’s) later one vdev of my 3 (3x2 mirror) metadata devices decide to act up even more and cause reboot on import already (instead of on write only).

At this moment i imported the pool read only and copy off as much as I can before recreating it (o/c i don’t have a backup since i am/was in the middle of deduplicating 3 backup servers into one new big main pool which includes much dish shuffling and left me with no backups. Not smart in hindsight).

Pool layout

pool: tank18t
state: ONLINE
scan: resilvered 2.09M in 00:00:00 with 0 errors on Sun Apr 19 21:17:52 2026
config:

    NAME                                      STATE     READ WRITE CKSUM
    tank18t                                   ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        7d9a5851-c812-4f77-991e-eeef004b70c7  ONLINE       0     0     0
        08ce322c-423b-4620-a298-5caf1083f623  ONLINE       0     0     0
        82be3c4b-6def-4c30-83be-20e224d10a27  ONLINE       0     0     0
        9af43f3a-ea81-4559-925a-74eb744cddbe  ONLINE       0     0     0
        fcfff8fa-9218-49cf-9081-9b2470be5908  ONLINE       0     0     0
        79d77d63-6fc0-46b2-8cbb-8c96553ededa  ONLINE       0     0     0
        619ce353-29c0-498e-bb8f-8c8564a7387b  ONLINE       0     0     0
        4a2a1378-9bf7-4ea9-bacf-92b4094605e8  ONLINE       0     0     0
    special
      mirror-1                                ONLINE       0     0     0
        a65f45a9-daf9-437e-aeb4-3a8ddc20db7a  ONLINE       0     0     0
        921db10e-a1f8-4bc4-b48f-f2b535002af5  ONLINE       0     0     0
      mirror-2                                ONLINE       0     0     0
        7f48692b-2ff3-435e-ae56-e9f3bd6eb852  ONLINE       0     0     0
        7b00d031-35bb-48eb-93e2-51a16ad87ad3  ONLINE       0     0     0
      mirror-3                                ONLINE       0     0     0
        244f25cc-be3a-409d-a670-fb21d0ed49fc  ONLINE       0     0     0
        da55dab6-ce08-4d64-b449-fdc1d3f9ba7d  ONLINE       0     0     0

errors: No known data errors

Now, with that background to my actual point of discussion - is there any other explanation of defective drives that could cause the system to reboot upon writes with out a log entry?

I’ve

  • swapped out the PSU for a bigger one (500W to 1Kw), it probably was a bit underpowered (10 sata spinners , 25 sata ssds, a2000, X12SCZ-F/1290p, a X710 Nic. an m2 nvme, in an 847 chassis)
  • PDB is the same but it should be fine here , its designed to run 36 spinners after all
  • I moved the drives between front and back backplane (EL1’s) and onto different slots
  • I swapped out the hba (930016i to 24i back to the 16i)
  • I moved everything to another system (846 on older hw) but only after the reboot loop started

The drives (Micron 5100 3,84) were used in a previous build for a year without anything noteworthy, all drives have same fw, no known issues with any of the others i have

Short smart shows up fine. will run longer test after I evacuated data

Looking for ideas why this happens…

Bad luck putting two defective drives in a single mirror vdev? Or other issue that only triggered the chain of events…

Edit: I need a way to identify why those drives act the way they did so i can test others to see if they have the same issue before i use them as metadata device on the next pool…

Thanks

p.s. I am on 25.04.2.6, an initial 25.10 attempt didnt go so well so I decided to let that mature some more. Although the A2000 doesnt seem to work anymore on the latest update either but thats another issue)

1 Like

Hi, In my experience you have already done a solid job ruling out most system level issues so this really points toward a hardware fault under load. Sudden reboots with no logs are often caused by SSD firmware hangs or controller timeouts rather than obvious PSU limits.

Since the issue started during heavy writes and now even happens on import, a failing SSD in the metadata vdev is still the most likely cause. It’s possible one (or even two) drives are triggering a bus reset, which can bring the whole system down.

I’d run long SMART tests, check error logs (not just health status), and stress test each drive individually in another system if possible. Also worth checking for interface errors or firmware updates. Your plan to evacuate data is definitely the right call here.

Thanks :slight_smile:

Will be a while before everything is copied off, just happy that the pool is still usable for now.

Found some partial older backups too, so not as close a call as I thought, but a good reminder that even a working pool needs regular backups to be really safe.