Server crashed. ZFS kernel panic. I think i have narrowed down the problematic pool. now how to restore it?

Tha_reaper · December 28, 2025, 2:09pm

2 days ago my server suddenly went “poof” in the middle of the night. a nice christmas present while i was away from home with family. When i came back home is started troubleshooting but my linux/zfs knowledge is limited and i would love some advice before blindly following whatever gemini tells me to and probably screwing stuff up more instead of fixing it.

I was running 25.10.1 when the server crashed. When i started it, it would boot up to middleware “applying kernel variables” and then shut off and reboot, over and over.
When selecting an old install to boot (25.04.2.6) i dot greeted by this kernel panic:

I have 3 pools:

boot-pool
Klemmers (my spinning rust data tank)
apps (mirror of a sata SSD and an external USB SSD for my apps)

Things i tried:

Memtest 6 passes all tests: Passed
Booting with only the boot-pool attached. unplugged everything else: Failed at the same point
Booting a fresh installation on a brand new boot nvme: worked!
Uploading my previous config to that fresh install: same boot loop (even with all the cables unattached)
Importing pool “apps” on a fresh install without previously restoring old config file: system crashed, rebooted and showed my 2 unimported pools again.
Importing pool “Klemmers” on a fresh install without previously restoring old config file: Pool was successfully imported.
Importing my configuration after the successful import of the Klemmers pool while the apps pool is disconnected: also worked. (why the hell didn’t this work on a clean install with all cables disconnected?)

So im guesing that something with the apps pool is screwed up. I have a daily rsync backup of my apps pool on the Klemmers pool (and the whole Klemmers pool is backed up daily to backblaze excluding all my media). but now i’m unsure on how to proceed. I’m not looking forward to having to reconfigure all my apps, users, folder structures, ect so i was wondering if there was a way to salvage the apps pool.

awalkerix · December 28, 2025, 2:19pm

You might be able to import the pool readonly and copy data off of it.

essinghigh · December 28, 2025, 3:27pm

Check if you can import the pool readonly. This happens every now and then with spacemap corruption.

zpool import -o readonly=on -R /mnt apps

If it can import, copy important data off of the pool.

There’s a good chance you can import it with zfs_recover enabled without any data corruption but it never hurts to be safe.

essinghigh · December 28, 2025, 5:02pm

Skipped over this. Seeing as you have backups you can see what happens.

As I mentioned, this “removing nonexistent segment from range tree” panic comes up here every so often. It is usually a pretty simple fix.

Try this:

echo 1 | sudo tee /sys/module/zfs/parameters/zfs_recover

Then import via the WebUI, then run a scrub.

I would let this run, and then leave it for some time to allow metaslab condensing to occur (probably worth leaving it to the next day). You can then try rebooting and see if it imports normally. You should be okay to continue using the pool during this time.

Tha_reaper · December 28, 2025, 5:53pm

thanks for the advice, but i may have a silly question: i cant plug in the drives while the server is running (well…. technically i could with the USB SSD), and when i power down the server, plug in the drives, it does not boot any longer. should i export or remove the offline apps pool using the webUI first, or is there a CLI command to use for that?

i might do the readonly thing anyway and make a second more recent backup. might be paranoid, but can’t hurt.

essinghigh · December 28, 2025, 6:06pm

If you can’t hotplug the disks, you should be able to just boot with the param set.

On the GRUB boot menu, press e on the boot entry:

Then add zfs.zfs_recover=1 to the line starting with linux (this line is wrapped):

You can then press Ctrl+x to boot (this is a one-time edit and does not save), zfs_recover should be enabled as the pools import.

Tha_reaper · December 28, 2025, 6:30pm

thanks so much for the detailed explanation. i’ll give this a try tomorrow when i have more free time (if the baby allows me any time off).

Tha_reaper · December 29, 2025, 8:46am

This totally did the trick. pool got imported, some errors were corrected at first boot, and everything popped right back up and all my apps were running again. Thanks a lot for your help!
Do you have any idea what could have caused this? is this a common problem that can be prevented?

essinghigh · December 29, 2025, 12:38pm

I’ve only ever seen this on systems running Non-ECC memory. I’ve never seen it reported by the same person twice.

removing nonexistent segment means that as ZFS was replaying the spacemap to build the in-memory tree of free space, it tried to process an instruction to “remove” a segment of space that the tree didn’t have a record of. Perhaps you had a bit flip for the offset value

Tha_reaper · December 30, 2025, 11:19am

well, lets hope it stays that way!

Tha_reaper · December 30, 2025, 4:43pm

sorry to bother you again, but this error just popped up:

root@truenas[/home/admin]# zpool status apps -v
  pool: apps
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Dec 30 17:40:42 2025
        149G / 148G scanned, 4.57G / 148G issued at 103M/s
        0B repaired, 3.09% done, 00:23:46 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        apps                                      ONLINE       0     0     0
          mirror-0                                ONLINE       0     1     0
            1acc454d-fce4-4ed1-842b-b792957ff6e1  ONLINE       0     1     0
            cef98fd0-12ef-4c21-8942-a717cdc593e5  ONLINE       0     1     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x449>

does this have anything to do with the previous problem? the scrub yesterday returned 0 errors, and now this kind of worries me.

etorix · January 4, 2026, 3:45pm

Now you have a real issue, which suggest that there is some defective hardware. Could be RAM, PSU, motherboard…
Good luck chasing it!

Tha_reaper · January 4, 2026, 4:49pm

After a scrub it disappeared… I feel like I’m chasing ghosts. My first course of action will be to get rid of the USB SSD and replace it with an nvme drive.

etorix · January 4, 2026, 7:49pm

Also the permanent error in metadata is gone?

Tha_reaper · January 4, 2026, 8:28pm

Yes. Disappeared, rebooted and scrubbed again. No metadata error to be found

essinghigh · January 4, 2026, 9:41pm

Sorry I completely forgot to respond to this.
Afaik <metadata> indicates it’s not a user file, so most likely corruption of something used for internal state. Considering the panics you were hitting were due to spacemap corruption, if I had to guess, the scrub picked that bit of corruption up before it was rewritten (metaslab condense operation) which would also explain why the new scrub is ok.

Topic		Replies	Views
ZFS Pool 'Data' Causes Kernel Panic TrueNAS General SCALE , ZFS	35	1017	February 13, 2025
All drives in a pool degraded - many checksum errors. Pool data seems ok, all errors are in rrd files TrueNAS General CORE , Hardware , ZFS	12	1389	June 17, 2024
Endless checksum errors between multiple drives TrueNAS General	26	398	November 8, 2025
Pool state "FAULTED" with "corrupted data", yet only one drive unavailable TrueNAS General SCALE , Hardware , ZFS	17	1757	April 30, 2024
Specific dataset snapshots corrupted, but pool scrub shows no errors? TrueNAS General	10	135	April 6, 2026

Server crashed. ZFS kernel panic. I think i have narrowed down the problematic pool. now how to restore it?

Related topics