Kernel Oops after ZFS version upgrade: unable to mount pool anymore

Hello,
Last week I upgraded one of my 2 NAS to 25.04.2.1, I was starting from 23.04 with TrueCharts apps, I deleted all of them and did the upgrade to 23.10.2, then to 24.04.2.5, then to 24.10.2.4, and finally to 25.04.2.1. I installed all my apps from the new catalog, imported all the configuration backups and everything was anazingly fine considering the big step I did.
The other NAS was already on 25.04.2.1 as it’s running just nextcloud.

So after a week of everything stable I dedided to upgrade the zfs pool version to the latest one on all the pools and didn’t noticed anything strange.
Then 25.04.2.3 was released and I applied it on both system.
After a while I noticed that my main NAS (the one that went through all the upgrades) wasn’t back online, so I checked it and I found out a kernel message that looked like a panic.
I said “ok, the upgrade failed I will just reinstall the latest version and restore the config from backup”, So I did a fresh install and restored the configuration, but I got the same error.
Ok I thought that maybe this minor version has a bug so I reinstalled the version 25.04.2.1 but the error was the same.
So I thought that I also upgraded the zfs pool version on all the pools and tried to unplug one pool at a time: I was the able to isolate this error to the “safe” pool which is a mirror of 2 disks.
When the system boots with this pool connected it goes in timeout while starting its services, and then it’s quite unusable (super slow)

Sep 1 15:22:40 micro8 kernel: PGD 0 P4D 0
Sep 1 15:22:40 micro8 kernel: Oops: Oops: 0000 [#1] PREEMPT SMP PTI
Sep 1 15:22:40 micro8 kernel: CPU: 0 UID: 0 PID: 2371 Comm: txg_sync Tainted: P IOE 6.12.15-production+truenas #1
Sep 1 15:22:40 micro8 kernel: Tainted: [P]=PROPRIETARY_MODULE, [I]=FIRMWARE_WORKAROUND, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Sep 1 15:22:40 micro8 kernel: Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
Sep 1 15:22:40 micro8 kernel: RIP: 0010:metaslab_sync+0x34/0x840 [zfs]
Sep 1 15:22:40 micro8 kernel: Code: 57 41 56 41 55 41 54 49 89 f4 55 53 48 89 fb 48 83 ec 38 4c 8b af 88 07 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 <4d> 8b b5 90 00 00 00 49 8b 6e 68 48 89 ef e8 39 97 02 00 44 8b 83
Sep 1 15:22:40 micro8 kernel: RSP: 0018:ffffa7ed0bf2bd48 EFLAGS: 00010246
Sep 1 15:22:40 micro8 kernel: RAX: 0000000000000000 RBX: ffff8e0a40c26000 RCX: 0000000000000000
Sep 1 15:22:40 micro8 kernel: RDX: 0000000000000000 RSI: 0000000003c8b89d RDI: ffff8e0a40c26000
Sep 1 15:22:40 micro8 kernel: RBP: ffff8e0a1918abb8 R08: 0000000000000000 R09: 0000000000000000
Sep 1 15:22:40 micro8 kernel: R10: ffffa7ed0bf2bd60 R11: 0000008000000000 R12: 0000000003c8b89d
Sep 1 15:22:40 micro8 kernel: R13: 0000000000000000 R14: ffff8e0a43dd8000 R15: ffff8e0a40c26000
Sep 1 15:22:40 micro8 kernel: FS: 0000000000000000(0000) GS:ffff8e0cfb800000(0000) knlGS:0000000000000000
Sep 1 15:22:40 micro8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 1 15:22:40 micro8 kernel: CR2: 0000000000000090 CR3: 0000000140d82005 CR4: 00000000001726f0
Sep 1 15:22:40 micro8 kernel: Call Trace:
Sep 1 15:22:40 micro8 kernel:
Sep 1 15:22:40 micro8 kernel: ? __die+0x23/0x70
Sep 1 15:22:40 micro8 kernel: ? page_fault_oops+0x173/0x5b0
Sep 1 15:22:40 micro8 kernel: ? search_module_extables+0x19/0x60
Sep 1 15:22:40 micro8 kernel: ? search_bpf_extables+0x5f/0x80
Sep 1 15:22:40 micro8 kernel: ? exc_page_fault+0x76/0x190
Sep 1 15:22:40 micro8 kernel: ? asm_exc_page_fault+0x26/0x30
Sep 1 15:22:40 micro8 kernel: ? metaslab_sync+0x34/0x840 [zfs]
Sep 1 15:22:40 micro8 kernel: ? metaslab_sync+0x3fa/0x840 [zfs]
Sep 1 15:22:40 micro8 kernel: ? gethrtime+0x2d/0x60 [zfs]
Sep 1 15:22:40 micro8 kernel: vdev_sync+0x6f/0x190 [zfs]
Sep 1 15:22:40 micro8 kernel: spa_sync_iterate_to_convergence+0x15d/0x200 [zfs]
Sep 1 15:22:40 micro8 kernel: spa_sync+0x30a/0x600 [zfs]
Sep 1 15:22:40 micro8 kernel: txg_sync_thread+0x1ec/0x270 [zfs]
Sep 1 15:22:40 micro8 kernel: ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Sep 1 15:22:40 micro8 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Sep 1 15:22:40 micro8 kernel: thread_generic_wrapper+0x5d/0x70 [spl]
Sep 1 15:22:40 micro8 kernel: kthread+0xd2/0x100
Sep 1 15:22:40 micro8 kernel: ? __pfx_kthread+0x10/0x10
Sep 1 15:22:40 micro8 kernel: ret_from_fork+0x34/0x50
Sep 1 15:22:40 micro8 kernel: ? __pfx_kthread+0x10/0x10
Sep 1 15:22:40 micro8 kernel: ret_from_fork_asm+0x1a/0x30
Sep 1 15:22:40 micro8 kernel:
Sep 1 15:22:40 micro8 kernel: Modules linked in: ib_core(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha25
6_ssse3(E) sha1_ssse3(E) aesni_intel(E) gf128mul(E) crypto_simd(E) cryptd(E) rapl(E) intel_cstate(E) intel_uncore(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) pcspkr(E) serio_raw(E) iTCO_wdt(E) mgag200(E) drm_shmem_helper(E) intel_pmc_bxt(E) iTCO_vendor_support(E) drm_kms_helper(E) hpwdt(E) hpilo(E) evdev(E) i2c_algo_bit(E) joydev(E) watchdog(E) ipmi_si(E) ie31200_edac(E) acpi_power_meter(E) acpi_ipmi(E) button(E) ipmi_devintf(E) ipmi_msghandler(E) sg(E) loop(E) drm(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) xhci_pci_renesas(E) ata_generic(E) xhci_pci(E) uhci_hcd(E) ehci_pci(E) ata_piix(E) crc32_pclmul(E) ehci_hcd(E) xhci_hcd(E) tg3(E) nvme(E) libata(E) crc32c_intel(E) psmouse(E) libphy(E)
Sep 1 15:22:40 micro8 kernel: lpc_ich(E) usbcore(E) scsi_mod(E) usb_common(E) scsi_common(E) nvme_core(E)
Sep 1 15:22:40 micro8 kernel: CR2: 0000000000000090
Sep 1 15:22:40 micro8 kernel: —[ end trace 0000000000000000 ]—
Sep 1 15:22:40 micro8 kernel: RIP: 0010:metaslab_sync+0x34/0x840 [zfs]
Sep 1 15:22:40 micro8 kernel: Code: 57 41 56 41 55 41 54 49 89 f4 55 53 48 89 fb 48 83 ec 38 4c 8b af 88 07 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 <4d> 8b b5 90 00 00 00 49 8b 6e 68 48 89 ef e8 39 97 02 00 44 8b 83
Sep 1 15:22:40 micro8 kernel: RSP: 0018:ffffa7ed0bf2bd48 EFLAGS: 00010246
Sep 1 15:22:40 micro8 kernel: RAX: 0000000000000000 RBX: ffff8e0a40c26000 RCX: 0000000000000000
Sep 1 15:22:40 micro8 kernel: RDX: 0000000000000000 RSI: 0000000003c8b89d RDI: ffff8e0a40c26000
Sep 1 15:22:40 micro8 kernel: RBP: ffff8e0a1918abb8 R08: 0000000000000000 R09: 0000000000000000
Sep 1 15:22:40 micro8 kernel: R10: ffffa7ed0bf2bd60 R11: 0000008000000000 R12: 0000000003c8b89d
Sep 1 15:22:40 micro8 kernel: R13: 0000000000000000 R14: ffff8e0a43dd8000 R15: ffff8e0a40c26000
Sep 1 15:22:40 micro8 kernel: FS: 0000000000000000(0000) GS:ffff8e0cfb800000(0000) knlGS:0000000000000000
Sep 1 15:22:40 micro8 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 1 15:22:40 micro8 kernel: CR2: 0000000000000090 CR3: 0000000140d82005 CR4: 00000000001726f0
Sep 1 15:22:40 micro8 kernel: note: txg_sync[2371] exited with irqs disabled

Of course I have all my data backed up in the other NAS so I can destroy and re-create this pool but as it’s a kernel error maybe you could study it and fix it? thanks

I’ve only found this topic to have a similar issue but I can’t find any solution there: Reboot loop,Unable to mount the pool as rw

I’ve reinstalled the system using latest 25.20 beta1 and trying to import the pool makes the system crash, I can import it read only.
What shall I do? open a ticket? thanks

I don’t know why it kernel panics when it tries to import the pool, but I doubt there’s much point in trying to recover the pool itself. If I were to guess what caused it, it would be some form of hardware issue.

Test your system for stability (at the very least memtest86) and then consider if you are willing to risk your data again with that system.

You can try posting a ticket before doing the above, without a support contract the outcome depends on if they see a reason to investigate the trigger/cause, but that’s a call up to the bug clerk.

I dont’ think it’s a hardware issue, the system is stable and memtest hasn’t show any issue. the zfs is mirrored so even if a disk has issue the other one should work.

I think it’s a zfs issue (the kernel shouldn’t panic while importing a pool…) it seems I’m not the only one Importing corrupted pool causes PANIC: zfs: adding existent segment to range tree · Issue #13483 · openzfs/zfs · GitHub PANIC at range_tree.c:435:range_tree_remove_impl() requiring hard reset · Issue #15594 · openzfs/zfs · GitHub I will try to open a support ticket.