Kernel Panic with Strange Error

marcelfarres · February 27, 2025, 5:28pm

I did not use anything but the GUI… That’s why I was quite suprize when the whole thing just stop working.

pncv87 · February 27, 2025, 6:29pm

Yeah that’s the same with me. Since it doesn’t seem like there is any chance of a fix coming in Fangtooth, I ended up using Ubuntu to to mount my pool and migrate my data to another NAS.

winnielinnie · March 26, 2025, 1:19pm

I highly recommend anyone who has ever removed a vdev to disable block-cloning.

For SCALE, you can use this Pre-Init command to disable it system-wide:

echo 0 > /sys/module/zfs/parameters/zfs_bclone_enabled

For Core, add this as a SYSCTL tunable:

Variable: vfs.zfs.bclone_enabled
Value: 0

Here is an update to the bug report, where a patch has been made for upstream OpenZFS.

This patch does not address pools that have already been affected. It mitigates against a kernel panic for pools that use block-cloning with indirect vdevs, whom have not yet tripped this bug.

etorix · March 26, 2025, 1:49pm

Thanks for keeping us updated.

But I don’t quite understand how disabling block cloning after removing a vdev would help. For now disabling block cloning at pool creation should prevent the issue to arise, if one expects to remove vdevs at some point. But once blocks have been cloned, how is that going to help with (past or future) vdev removal ?

My understanding is that future vdev removal requires that the patch be applied before a vdev is removed. Block-cloned pools in which a vdev has already been repoved potentially have metadata corruption and the only possible fix appears to backup, destroy and restore as soon as possible.
@mav can you comment?

winnielinnie · March 26, 2025, 2:01pm

No blocks have been cloned for me.

Just because the feature is “enabled”, does not mean that it has been used.

Here is my pool where I had removed mirror vdevs. You’ll notice is does not say “active”, but rather “enabled”:

NAME          PROPERTY               VALUE                  SOURCE
main-pool     feature@block_cloning  enabled                local

The pool has block-cloning enable, but luckily, I have never once “used” the feature.

Here is a pool where I already cloned blocks by simply using the cp command. You’ll notice is does not say “enabled”, but rather “active”:

NAME          PROPERTY               VALUE                  SOURCE
SSD-pool      feature@block_cloning  active                 local

This SSD pool does not have an indirect vdev.

You cannot “disable” feature@block_cloning once it has already been enabled, but you can prevent the feature from being used by disabling it system-wide. This assures me that I won’t accidentally use block-cloning for my main pool, which has an “indirect” vdev, as I had removed a couple mirrors in the past.

mav · March 26, 2025, 2:11pm

@etorix In part you are right. The pool that never had anything cloned at or after the time of device removal is not affected. The problem is in “after”. As I repeated many times in my criticism of device removal, one of its design flaws is that once started it might never actually finish, continuing to remap block pointers forever as long as something modifies indirect blocks or dnodes. And it is the remapping process that is incompatible with block cloning. My incoming patch is going to disable remapping for cloned blocks, same as already done for deduped and ganged ones for similar reasons. So if as in @winnielinnie case he never had anything cloned, then disabling it will keep him safe (but updating would do it too). If there are or were some cloned blocks since device removal, then pool might already be toast and disabling further clones may only slightly reduce the issue.

etorix · March 26, 2025, 2:31pm

Many thanks for the explanation!

As I am not familiar (euphemism…) with the 6000+ lines code beast, I relied on the title of commit and assumed that it modified accounting only during the removal itself (“on removal”) while it can actually apply at any time after removal.

@pncv87 Can you mark @winnielinnie 's recommendation as the “Solution” so that it is more prominently seen by readers?

winnielinnie · March 26, 2025, 2:56pm

I asked in the bug report, but I’ll ask it in here too, just in case so that others can read it: Will this patch be eligible in an update of Core 13.3?

pncv87 · March 26, 2025, 4:45pm

Hi @etorix, I’m not sure marking this as a solution is the right move. The fix provided does not seem to align with the issues that I saw. Specifically, I was able to mount and use the broken pool on Ubuntu as I stated previously, which would indicate that Ubuntu and it’s implementation of ZFS can recognize removed devices and addresses the RAM allocation correctly. I’m not sure why disabling block cloning would be an acceptable answer or solution for this, if it doesn’t provide a way to recover an affected pool in TrueNAS, which is possible to do in Ubuntu.

winnielinnie · March 26, 2025, 5:07pm

I think the solution is multiple steps.

Disable block-cloning to prevent more cloned blocks that might be deleted later. Hopefully you do not delete any existing cloned blocks that might trigger the bug before the patch is applied.
Wait for the patch to land in an update.
Reenable block-cloning again if you feel comfortable.

If you already tripped this bug, try to import your pool by other means and migrate your data to a new pool. Or wait for the more intricate fix (somewhere in the code related to shrinking an indirect vdev?) to be addressed and applied.

mav · March 27, 2025, 1:58pm

@pncv87 I believe the panic you provided is a consequence of the actual problem. I don’t know why it does not happen for you on Ubuntu. May be it just didn’t try to condense indirect mappings yet to spot the corruption. But once the system shown symptoms it is indeed too late to disable cloning.