ZFS Kernel Panic on bootup

noblepadawan · February 3, 2025, 4:35am

Can anyone help me? I am very much a noob at this. I deleted one of my pool snapshots from like 5 days ago. TrueNAS crashed and is now stuck in a bootloop. During boot, around this stage, the system crashes and reboots. Sometimes it doesn’t and I do a hard reboot to the same error.

(1 of 2) Job ix-zfs.service/start running (1min 20s / 15min 21s)

Sometimes I am able to load a few seconds past this screen to systemd-sysctl.service but then crash and get a kernel panic.

[  OK  ] Finished systemd-sysctl.service - Apply Kernel Variables.
[*     ] (2 of 2) Job ix-netif.service/start running (1min 25s / no limit)
[   92.641222] PANIC: zfs: adding existing segment to range tree (offset=b1c8726000 size=1a10000)
[   92.641241] Kernel panic - not syncing: zfs: adding existent segment to range tree (offset=b1c8726000 size=1a10000)
[   92.641241] CPU: 1 PID: 1905 Comm: tgx_sync Tainted: P          IOE      6.644-production+truenas #1

I have seen online about zpool status -v but I’m not even loaded yet to a usable shell. I have tried removing all of my hard drives and was able to boot and load the WebUI. However, loading the drives back in, I am back to the same error when booting up.

Also back in December, I ended up doing something stupid resulting in all my drives having the exact number of checksum errors (transferring /media from a folder to a /media dataset but reverting halfway through the process via snapshot due to the process’s increasingly growing pool storage usage growing near the max size of my combined hard drives under Raid-Z). I haven’t noticed any playback errors on the checksum error listed files, but I have been deleting and replacing those files and have been saving up to get additional new hard drives.

SmallBarky · February 3, 2025, 5:24am

We really need your system details and pool details. How were they set up?
Expand the Details under my post to see an example.

It sounds like you had Snapshots and then were moving files and ran out of pool space. That’s a guess at this point. The snapshots take up space for changes.

noblepadawan · February 3, 2025, 7:39am

TrueNAS ElectricEel 24.10.2
Intel Core i5-7500T 2.7GHz
ASRock H110M-ITX Motherboard
16GB DDR4 2133MHz 2x8GB
TEAMGROUP T-Force Vulcan Z 240GB SSD - boot
Western Digital Red HDD 7.17TiB 3x4TB Raid-Z1 - main pool

So back in December I was trying to move the majority of my files around and almost ran out of space and reverted to a prior snapshot (78% of 7.17TiB to like 90% of 8.73TiB when the move process was running). Snapshot saved the system but my all 3 of my drives reported the same number of checksum errors Running zpool status -v would show the files with the checksum errors. Checking the video playback of those files out in Jellyfin, I noticed no noticeable change in in video quality or audio so I just ignored the errors. January comes around and the ZFS Health warning messages start bothering me and the number of checksums appear to be growing. Running zpool status -v, it appears to be the new checksum errors of the same files in the snapshots. I thought of fixing it by redownloading the exact same media files from the original sources I got them and dropping and replacing the old files. As I’m doing that, my pool starts growing rapidly. Besides just temporary space for the new replacement file, I guessed the daily snapshots where tens or hundreds of GB where changing within a day where contributing to the rapid rise in my zpool usage. I go and delete some older daily snapshots that are around 5+ days old and then the server crashes and kernel panics when running ix-zfs.service on boot.

Since then I’ve tried hard rebooting resulting in the same errors. Booting without the hard drives attached allowed me to access the WebUI but the zpool was offline and there was no dataset. It recognized when I added the hard drives back into the machine and that they had exported pools, but there was no existing pool for me to add the drives to and I didn’t want to create a new pool and potentially risk all my data. I created a new fresh TrueNAS boot drive on a USB stick but importing the pool caused the machine to crash and reboot. Uploading my old config file to the new fresh boot drive caused the machine to now kernel panic and when running ix-zfs.service like the old drive. Switching back to the old drive, I saw online about running zfs.zfs_recover=1 and zfs.zil_replay_disable=1 in like the boot setting and then running the zpool import -f -F -X pool_name but that doesn’t seem to work.

essinghigh · February 3, 2025, 9:19am

This looks like metaslab/spacemap corruption, unfortunate but should be fixable.
Do you have a backup? Generally this would require a pool rebuild but as spacemaps are rewritten often you may just be able to import with zfs_recover and leave it for a couple days. I’ve hit this once myself and have helped recover another pool hitting the same bug (in both cases by rebuilding the pool), unfortunately no idea on the root cause though.

If you don’t already have any sort of backup, I would recommend importing the pool in readonly and copying important data off via SFTP (zpool import -Ff -o readonly=on -R /mnt pool_name) - this should work without causing a kernel panic without needing to make any changes to zfs params.

I’d confirm you’ve actually got zfs_recover enabled (cat /sys/module/zfs/parameters/zfs_recover & echo 1 > /sys/module/zfs/parameters/zfs_recover if not) before trying the import.