Kernel Panic with Strange Error

pncv87 · January 14, 2025, 1:26am

That’s a good idea! I don’t have any extra SSDs but I have a ton of spinning drives laying around so I should be able to cobble something together. I’ll update tomorrow. Thanks @winnielinnie and @NickF1227 !

Captain_Morgan · January 14, 2025, 4:23am

By all means report a bug.

I don’t think our QA testing includes “indirect vdevs” and I assume they are quite uncommon.

winnielinnie · January 14, 2025, 2:20pm

I am unable to, since I’m on Core.

@pncv87 can make a bug report, but there’s no way to attach a debug log, since the system hits a kernel panic upon bootup when it supposedly attempts to import the storage pool.

This could indeed by a combination of “Linux kernel < 6.8” + “OpenZFS” + “indirect vdev” in a pool.

Unfortunately, to wait until April (for Fangtooth’s release) is not practical just to be able to keep using your pool.

Uncommon or not, they are fully and officially supported in TrueNAS via the GUI.

Anyone with a pool containing only mirror vdevs can remove a mirror using the GUI. (As I have myself. As did @pncv87. As did the other user in the linked OpenZFS bug ticket who runs SCALE.)

pncv87 · January 14, 2025, 2:54pm

Bug ticket created. This is my first bug ticket ever so hopefully I filled it out correctly.

[NAS-133555] Kernel Panic “PANIC at vdev_indirect_mapping.c:528” - iXsystems TrueNAS Jira

winnielinnie · January 14, 2025, 3:26pm

You better hurry up and reply to the bug report, that you are unable to attach a debug file, otherwise they’ll close it.

Captain_Morgan · January 15, 2025, 6:30am

To be clear 24.10.2 is now going through its QA cycle and won’t be changed.

Fangtooth BETA is the earliest vehicle for fixing anything… hopefully its automatic.

BETAs are pretty solid for existing functionality…
we have standard regression tests in QA
we have to build new test for new functions… that lags.

To recover a pool its likely to be the best option… even a nightly can be used.

Unless we can work out when the bug was introduced and go backward…

pncv87 · January 15, 2025, 4:45pm

Hi @Captain_Morgan I’m assuming this means that if a fix was identified, it wouldn’t be introduced in 24.10.2?

And that the earlies possible release is to expect a fix, if it’s not already fixed, is Fangtooth?

I guess I’m wondering what would happen if this bug was experienced in a real enterprise that is running 24.10.1? I understand that we are using the CE version of software, but a bug that takes down an entire system would seem like something that should probably be fixed sooner rather than waiting for the next major release or a beta version.

Captain_Morgan · January 15, 2025, 7:30pm

In a real Enterprise or with a common Community problem, we consider a hot fix if its possible. However, if its a major OS upgrade it has to wait for a release cycle.

We haven’t found an issue like that that we couldn’t resolve without work…sometimes, we have to go back to previouis versions.

pncv87 · January 15, 2025, 10:05pm

Hi @Captain_Morgan, just finished attempting to import my pool to a fresh install of Fangtooth MASTER-20250114-005710 and it crashed with the same error. Seems like Fangtooth may have the same issue.

EDIT: I updated the bug ticket as well, and upload the Debug and kern.log files.

NickF1227 · January 15, 2025, 11:22pm

Since @Captain_Morgan has already given an official answer, I’m not going to speak to the direct question here. But I did want to say, and I want to be clear that I am not discounting your issue in anyway…

Putting on the sysadmin hat:
In an enterprise situation I would not expect to see this exact situation occur. A system with a proper support contract and an SLA wouldn’t have removed a VDEV from their pool, they would have contacted support and had a replacement drive shipped to them by iXsystems or whoever their hardware vendor of choice is. I say all of this to say, this feature is likely not used very often outside of homelab type usecases…and only ever rarely there.

That is unfortunate, and strange that you did not encounter the issue in Ubuntu.
When you had it imported in Ubuntu 24.04.1, what command exactly did you use? Just trying to determine if it was actually mounted or if it was just imported. If the problem is as it seems with your other testing (plus bug reports in ZFs linked in your ticket), I wouldn’t have expected it to work on any of the OS’s.

winnielinnie · January 15, 2025, 11:29pm

Removing a mirror vdev from a pool isn’t only about dealing with failing drives. I removed a mirror vdev to simplify my pool: fewer vdevs and larger capacity drives. I had no need for “replacement drives”, since I was not replacing anything. I was simply removing a mirror that was no longer necessary (nor desired) to have in the pool.

Original pool
-2-way mirror, 4TiB + 4TiB
Later became
-2-way mirror, 4TiB + 4TiB
-2-way mirror, 8TiB + 8TiB
Transition
-2-way mirror, 4TiB + 4TiB <— vdev to be removed
-2-way mirror, 8TiB + 8TiB <— vdev to be removed
-2-way mirror, 18TiB + 18TiB
Current pool
-3-way mirror, 18TiB + 18TiB + 18 TiB <— added extra drive for 3-way mirror

The other drives were recommissioned for other purposes.

It’s a fully supported ZFS feature, and TrueNAS (since FreeNAS 11.3, I believe) supports it in the GUI.

winnielinnie · January 15, 2025, 11:35pm

@pncv87 imported it with:
zpool import -d /dev/disk/by-partuuid init-pool

He was able to access his files, and even unlock an encrypted dataset within the pool.

The pool status was healthy, and all vdevs were healthy. The indirect vdev was also present, as seen by zpool status.

NickF1227 · January 15, 2025, 11:39pm

Just to further clarify my position again, I’m not saying it shouldn’t work. I’m saying that its probably a relatively unused and less tested feature set across the ZFS community at large.

This is a totally valid, logical upgrade path. For a homelab, however, not for business use in a production environment.

Speaking from experience here, I was in charge of a IT infrastructure for a large public school system for a long time. In a government/corporate situation you have 5 year budget cycles factored into your operating expenses and/or capital investments to purchase new systems. When your service contracts expire (Upgrade time!), you either:

*Buy a 3rd party support contract
*Buy a new system and/or disk shelf/drives
*Write a strongly worded risk statement to the powers that be and roll the dice.
*Redeploy for a less sensitive usecase, wiping all of the existing configuration and data

Adding additional risk to an older system in that third option would be silly. Wholly reconfiguring the storage subsystem would be a big risk and really poor form IMO, regardless of the filesystem.

NickF1227 · January 15, 2025, 11:43pm

Hmmm… Thats such an odd conflicting piece of information that really muddles the waters here.

Captain_Morgan · January 16, 2025, 1:04am

Ok… that probably eliminates the Linux Kernel version as the cause.

We need to find the root cause before we can think about how to fix it.

marcelfarres · February 27, 2025, 7:39am

I am in the same boat

In my case it happened in a 1-month-old installation, I got some old 2TB drives and set it up in a mirror x2 configuration. Not that long after adding a bunch of data they started failing, so I added a new vdev with x2 4 TB drives and removed the previous.

It survived a bunch of reboots and data transfers and today out of the blue the server went down and I ended up with the same boot loop… A bit concerning and worrying.

Thanks for taking the time and making this complete bug report!

I see that the problem is still not solved, looking forward to some updates as I would like to continue to use my server and get back my data.

marcelfarres · February 27, 2025, 9:18am

I think that I have a relevant piece of info.

Everything worked as expected until I ran a Scrub task, around the moment it finished, I was not able to boot again.

winnielinnie · February 27, 2025, 1:35pm

What is the hardware and version of SCALE?

Are you seeing the same error message related to an indirect vdev, as the OP experiences?

marcelfarres · February 27, 2025, 3:55pm

Linux Kernel–> 6.6.44
TrueNAS–> 24.10.2 (I think? I check for updates -3 days ago and it was good)

Motherboard → Asus P5E-VM HDMI ACPI
CPU–> Inteal Core 2 Quad CPU Q9550 with 8 Gb of DDR2-800 RAM

Here the Call Trance (


[ OK] Finished Ix-netif.service Setup TrueNAS Network.
[ <*] Job ix-zfs.service/start running (1min 51s / 16min 16s) Job ix-zfs.service/start running (1min 55s/16min 16s)
[142.144540] VERIFY3 (counts [index] + inner_size <= size) failed (4321280 <= 4263936) [ 142.144600] PANIC at vdev_indirect_mapping.c:528:vdev_indirect_mapping_increment_obsolete_count ()
142.144648] Kernel panic not syncing: VERIFY3 (counts [index] + inner_size <= size) failed (4321280 <= 4263936)
142.144701] CPU: 0 PID: 3534 Comm: z_indirect_cond Tainted: P DE 6.6.44-production+truenas #1
142.144754] Hardware name: System manufacturer PSE-VM HDMI/PSE-VM HDMI, BIOS 0604 07/16/2008
[142.144808] Call Trace:
[142.144819] <TASK>
[142.144836] dump_stack_lv1+0x47/0x60
[142.144866] panic+0x339/0x350
[142.144891] spl_panic+0xfb/0x120 [spl]
[142.144950] ? dnode_rele_and_unlock+0x55/0xe0 [zfs]
[142.145526] ? vdev_indirect_mapping_entry_for_offset_impl+0x5c/0xc0 [zfs] 142.146031] vdev_indirect_mapping_increment_obsolete_count+0xd6/0x110 [zfs]
[142.146537] load_obsolete_sm_callback+0x20/0x30 [zfs]
[142.147033] space_map_iterate+0x19b/0x410 [zfs]
[142.147534] ? _pfx_load_obsolete_sm_callback+0x10/0x10 [zfs]
[142.148034] vdev_indirect_mapping_load_obsolete_spacemap+0x47/0x90 [zfs] 142.148538 ] spa_condense_indirect_thread+0xcc/0x1f0 [zfs]
[142.149037] ? _pfx_thread_generic_wrapper+0x10/0x10 [spl]
[142.149097] zthr_procedure+0x12a/0x140 [zfs] 142.151426] ? _ pfx_zthr_procedure+0x10/0x10 [zfs]
[142.153831] [142.155908] thread_generic_wrapper+0x5e/0x70 [spl]
kthread+0xe8/0x120 142.157851] ? _pfx_kthread+6x10/0x10 [
[142.159697] ret_from_fork+0x34/6x50
[142.161604] ? _pfx_kthread+0x10/0x10
[142.163494] ret_from_fork_asm+0x1b/0x30
[142.165356] </TASK> 
[142.167188] Kernel Offset: 0x30200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[142.169016] Rebooting in 10 seconds...

winnielinnie · February 27, 2025, 4:27pm

It’s looking like the same issue as experienced by others.

The “indirect vdev” is the common theme. Which by the way, is fully supported in ZFS, and does not require going under the hood of TrueNAS. Removing a mirror vdev from a pool is fully supported in the GUI.