Bringing back a SUSPENDED pool that doesn't show up with `zpool status`

HoneyBadger · June 11, 2025, 4:30pm

Rebooting will probably mean the pool doesn’t import as we never changed the state of the pool, we’re just browsing it at the previous state.

@coolnodje if you know of any very recently modified files you can look for those to see if they are present, but without the status being able to ID the damaged files here it may not be possible to get it.

If you export the pool with zpool export subramanya and then re-import without -o readonly=on then it will roll the pool back on-disk to that transaction time. Once that happens, then reboot the system (which will clear the tunables) and see if it imports automatically. If it does (which it hopefully will) then run a full scrub and look at the output of zpool status -v when complete.

coolnodje · June 11, 2025, 5:00pm

Unable to look up specific files, and anyway it doesn’t feel like I would be able to recover them even I was able to !

Procedure worked, pool was imported after reboot.

Scrub launched.

Looks like a success so far

coolnodje · June 12, 2025, 7:18am

Everything is back on track.
And if lost anything, I still can’t tell.

Quite conveniently my main docker host VM started like a charm and everything is up and running.

I’m quite sorry for providing such a convoluted case, that was due to multiple hardware issues. The recovery procedure was straightforward in the end and should’ve taken a lot less energy.
But hardware issues are now solved, and everything is running on newly assembled, providing one additional 6w2xmirrored pool, more powerful, more reliable hardware. So all in all maybe that crash was a good thing.

I’ll now review all my backup plans (as promised to myself), and with a new pool, and an old box still available (maybe) that will be a lot easier.

Massive thanks to everyone involved, thank you for the immediate and kind assistance.

PK1048 · June 12, 2025, 12:55pm

Remember to take a look at temperatures after a week.

coolnodje · June 12, 2025, 5:00pm

I sure monitor it daily for now, what with the summer temperature increase!

But a R730XD is quite a different beast from what I was using before. The air flow is really good.
I’ve had to put in place a fancontrol script otherwise with a non Dell controller the fan spins so high it’s like a plane taking off.
Working for now, no drives over 35C, even during scrub or zpool import!

etorix · June 16, 2025, 1:54pm

Following on on the JIRA ticket, it seems that a raidz2 with one REMOVED and one FAULTED should remain online, as expected, and the likely cause is then a third failure due to excess temperature.
So that would be a reported case of the Silverstone DS-380 cooking a pool to near death…

HoneyBadger · June 16, 2025, 3:04pm

My hypothesis is that heat is the cause here. A story, written in ASCII, of six drives:

A B C D E F

At some point, drive A overheats, decides it needs to step out for some air, and stops responding to commands.

a B C D E F

Pool can stay online because RAIDZ2. Eventually drive A cools off enough from ignoring the commands and returns to the pool, but it’s rebuilding the data it was missing while it stepped out.

a B C D E F

This rebuild means drive A is now writing, and drives B-F are busy reading. More heat happens. Maybe drive C decides “that’s enough, I need a moment” - but drive A isn’t rebuilt yet, and is still in lowercase.

a B c D E F

Again - pool is still live, because RAIDZ2, but drive A isn’t back yet. Maybe it recovers in time, but as soon as it does, drive C is still rebuilding and drive D takes a nap:

A B c d E F

At this point, we’re just basically doing the “spinning plates” trick, until at some point drives C and D aren’t back, and drive F decides to press F to pay respects:

A B c d E f

At this point, you’ve got drives C and D at various degrees of both “rebuilt” and “reported internal temperature” and as such can’t provide the current state of the pool, because they were “behind” the transaction number of drives A B E F - and the pool suspends.

coolnodje · June 16, 2025, 7:05pm

Please keep in mind that none of the disks reported Time in Over-Temperature but one, and it wasn’t among the ones that reported errors (FAULTED or REMOVED).

Moreover wouldn’t TN report errors for any disk that would stop functioning due to heat? In this case, only two errors were reported.

As @PK1048 suggested, maybe other hardware parts may have been impacted by heat produced by the drives (which themselves could withstand).

As further evidence, once I had the drives back to normal temperature, I had other unexplainable hardware issues during zpool import dry run operations.
These were drives becoming inaccessible:

spa_load(subramanya, config untrusted): FAILED: unable to open vdev tree [error=2]
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 15005074635607672362, path: N/A, can't open
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():     vdev 0: raidz, guid: 1843460257378462166, path: N/A, can't open
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 12560554049367260037, path: /dev/disk/by-partuuid/69e33d70-2e29-4440-bafd-0f183720274b, can't open
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 1: disk, guid: 3504886088678984596, path: /dev/disk/by-partuuid/90aea466-1527-4960-ad48-98bc5f0ffd21, can't open
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 2: disk, guid: 1026282999581038360, path: /dev/disk/by-partuuid/ddb020b8-2290-495d-8343-089f94d31ad4, healthy
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 3: disk, guid: 4494896413352636629, path: /dev/disk/by-partuuid/4e570c79-6ff8-4ab4-94e3-0fc804c1a3a7, healthy
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 4: disk, guid: 18426049172646806820, path: /dev/disk/by-partuuid/e6d7bead-7310-43d8-8236-9aba92acd3c8, can't open
1749267364   ffff9155ee2f8000 vdev.c:219:vdev_dbgmsg_print_tree():       vdev 5: disk, guid: 10710542068254203083, path: /dev/disk/by-partuuid/33870927-1683-4d16-a597-7efee3e7b405, healthy
1749267364   ffff9155ee2f8000 spa_misc.c:429:spa_load_note(): spa_load(subramanya, config untrusted): UNLOADING

I really don’t know how it works, but I suppose it could be caused by the disk controller or other electronic components.

HoneyBadger · June 18, 2025, 3:03pm

Heat in general is poor for a lot of components, be it the drives themselves or the storage controller they’re connected to.

Time in Over-Temperature as a SMART value is likely to be used by the manufacturer to deny a warranty claim for the unit being run out of spec - it’s not a binary “drive will be perfectly fine up to this point, and spontaneously combusts above it” - there are multiple studies that show a relationship between failure rates and elevated temperatures, and it’s possible that the drives without an over-temp condition were still experiencing errors or difficulty - just not enough to fully knock themselves off the bus.

Root-causing this will be rather difficult as far as reproducing the thermal conditions being something we’re able to directly control.