Server crashed. ZFS kernel panic. I think i have narrowed down the problematic pool. now how to restore it?

2 days ago my server suddenly went “poof” in the middle of the night. a nice christmas present while i was away from home with family. When i came back home is started troubleshooting but my linux/zfs knowledge is limited and i would love some advice before blindly following whatever gemini tells me to and probably screwing stuff up more instead of fixing it.

I was running 25.10.1 when the server crashed. When i started it, it would boot up to middleware “applying kernel variables” and then shut off and reboot, over and over.
When selecting an old install to boot (25.04.2.6) i dot greeted by this kernel panic:

I have 3 pools:

  • boot-pool
  • Klemmers (my spinning rust data tank)
  • apps (mirror of a sata SSD and an external USB SSD for my apps)

Things i tried:

  • Memtest 6 passes all tests: Passed
  • Booting with only the boot-pool attached. unplugged everything else: Failed at the same point
  • Booting a fresh installation on a brand new boot nvme: worked!
  • Uploading my previous config to that fresh install: same boot loop (even with all the cables unattached)
  • Importing pool “apps” on a fresh install without previously restoring old config file: system crashed, rebooted and showed my 2 unimported pools again.
  • Importing pool “Klemmers” on a fresh install without previously restoring old config file: Pool was successfully imported.
  • Importing my configuration after the successful import of the Klemmers pool while the apps pool is disconnected: also worked. (why the hell didn’t this work on a clean install with all cables disconnected?)

So im guesing that something with the apps pool is screwed up. I have a daily rsync backup of my apps pool on the Klemmers pool (and the whole Klemmers pool is backed up daily to backblaze excluding all my media). but now i’m unsure on how to proceed. I’m not looking forward to having to reconfigure all my apps, users, folder structures, ect so i was wondering if there was a way to salvage the apps pool.

You might be able to import the pool readonly and copy data off of it.

Check if you can import the pool readonly. This happens every now and then with spacemap corruption.

zpool import -o readonly=on -R /mnt apps

If it can import, copy important data off of the pool.

There’s a good chance you can import it with zfs_recover enabled without any data corruption but it never hurts to be safe.

Skipped over this. Seeing as you have backups you can see what happens.

As I mentioned, this “removing nonexistent segment from range tree” panic comes up here every so often. It is usually a pretty simple fix.

Try this:

echo 1 | sudo tee /sys/module/zfs/parameters/zfs_recover

Then import via the WebUI, then run a scrub.

I would let this run, and then leave it for some time to allow metaslab condensing to occur (probably worth leaving it to the next day). You can then try rebooting and see if it imports normally. You should be okay to continue using the pool during this time.

1 Like

thanks for the advice, but i may have a silly question: i cant plug in the drives while the server is running (well…. technically i could with the USB SSD), and when i power down the server, plug in the drives, it does not boot any longer. should i export or remove the offline apps pool using the webUI first, or is there a CLI command to use for that?

i might do the readonly thing anyway and make a second more recent backup. might be paranoid, but can’t hurt.

If you can’t hotplug the disks, you should be able to just boot with the param set.

On the GRUB boot menu, press e on the boot entry:

Then add zfs.zfs_recover=1 to the line starting with linux (this line is wrapped):

You can then press Ctrl+x to boot (this is a one-time edit and does not save), zfs_recover should be enabled as the pools import.

thanks so much for the detailed explanation. i’ll give this a try tomorrow when i have more free time (if the baby allows me any time off).

1 Like

This totally did the trick. pool got imported, some errors were corrected at first boot, and everything popped right back up and all my apps were running again. Thanks a lot for your help!
Do you have any idea what could have caused this? is this a common problem that can be prevented?

I’ve only ever seen this on systems running Non-ECC memory. I’ve never seen it reported by the same person twice.

removing nonexistent segment means that as ZFS was replaying the spacemap to build the in-memory tree of free space, it tried to process an instruction to “remove” a segment of space that the tree didn’t have a record of. Perhaps you had a bit flip for the offset value :slight_smile:

1 Like

well, lets hope it stays that way!

sorry to bother you again, but this error just popped up:

root@truenas[/home/admin]# zpool status apps -v
  pool: apps
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Dec 30 17:40:42 2025
        149G / 148G scanned, 4.57G / 148G issued at 103M/s
        0B repaired, 3.09% done, 00:23:46 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        apps                                      ONLINE       0     0     0
          mirror-0                                ONLINE       0     1     0
            1acc454d-fce4-4ed1-842b-b792957ff6e1  ONLINE       0     1     0
            cef98fd0-12ef-4c21-8942-a717cdc593e5  ONLINE       0     1     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x449>

does this have anything to do with the previous problem? the scrub yesterday returned 0 errors, and now this kind of worries me.

Now you have a real issue, which suggest that there is some defective hardware. Could be RAM, PSU, motherboard…
Good luck chasing it!

After a scrub it disappeared… I feel like I’m chasing ghosts. My first course of action will be to get rid of the USB SSD and replace it with an nvme drive.

Also the permanent error in metadata is gone?

Yes. Disappeared, rebooted and scrubbed again. No metadata error to be found

1 Like

Sorry I completely forgot to respond to this.
Afaik <metadata> indicates it’s not a user file, so most likely corruption of something used for internal state. Considering the panics you were hitting were due to spacemap corruption, if I had to guess, the scrub picked that bit of corruption up before it was rewritten (metaslab condense operation) which would also explain why the new scrub is ok.