Upgraded to truenas community v25.04.1 and still have nasty watchdog crashes :(

jeffv03 · May 29, 2025, 9:23pm

Hello I have a new install of v25.04.0 that have been upgraded to v25.04.1.

and I’ve attempted to import my pools in it, but I still (like on v24.10.xxx get the nasty watchdog bug showing its nose
I’ve taken a photo of the screen report as I can’t get the text version on this machine in full :

how can I get rid of such crashes that stall the system ?

Captain_Morgan · May 30, 2025, 4:28am

So, when did the problem 1st emerge?

I assume that its your “my NAS” system… was it evey fully functional and stable?

jeffv03 · May 30, 2025, 6:42am

Hello
yes it was 'till one drive began to fail, (it is currently being replaced by an hot spare disk and resilvered but this bug slows things) it was in v24.10.xxx, I hopped that v25 upgrades will cure this but not
what log do you need to address this bug ?

Captain_Morgan · May 30, 2025, 2:04pm

Its ussually one of three things:

Configuration error
Hardware issue/defect
Bug

This update seems to indicate a hardware failure…

The more info on the earlier incident, the more likely someone can guess what it is. I’d roll back to 24.10 and diagnose from there.

jeffv03 · May 30, 2025, 10:47pm

hardware failure ?
btw I have tested the drive that’s getting ‘dead’ in the pool and perfomed a ddrescue to another drive to it (took about 15 days !) and you know what ? ddrescue hasn’t found ANY read errors on it from block 0 to the last one !..
but that same disk connected to the pool finish invariably to spurt 120k read errors and power-on resets (but hasn’t at all in 15 days on getting read by ddrescue…)… that’s after that that the watchdog do appears and stall everything.
I’m kind of puzzled there, I’ve replaced the sata cable of it with no changes, haven’t swapped the HBA connector it reaches thought… (danm’ I don’t have another HBA Extender of the same model to test with)
will try that too

BTW II you said revert to 24.10, ok but I’ve performed the zfs updates of v25, isn’t it dangerous ?

Captain_Morgan · May 30, 2025, 10:55pm

Yes, don’t change

As a general rule, try to handle one issue at at a time.

If there’s a hardware issue and then their is a software update, two things are true:

Its never been tested
The number of people who understand both and the potential interactions is almost zero.

jeffv03 · May 31, 2025, 2:58pm

well, so

I’ve performed a total revamp of the disks cabling so it are not using the HBA expander port that was seemingly causing problem, for this the now cabling is this :
1st port of the HBA goes to the 1st port of the expander card ;
2nd port of the HBA is used by the four 1st disks of the zpool (ID 0-3) ;
2nd port of the expander is not used ;
3rd port of the expander is used by the next four disks of the zpool (ID 4-7) ;
4th port of the expander is used by the next four disks of the zpool (ID 8-11) ;
5th port of the expander is used by the next four disks of the zpool (ID 12-15) ;
6th port of the expander is used by the last four disks of the zpool (ID 16-19) ;
ID19 is used by the cache ssd.
now booting shows no errors, but the zpool isn’t imported…

so I tried to import it by hand (see the pic) and after some time
got a panic + reset from the debug kernel I booted on…

well what to do next ?

Captain_Morgan · June 1, 2025, 9:53pm

No idea…

But the spa-Maxblocksize is enormously wrong

see this for reference.

github.com/openzfs/zfs

VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072)

opened 09:16PM - 13 Jul 15 UTC

closed 04:34PM - 28 Aug 15 UTC

behlendorf

Originally from #354, filed here so it can be tracked. --- I recently patched …to a level that included this fix. I did not upgrade my pool, but was able to successfull create a zvol with a 1MB block size, which immediately triggered the following panic: ``` [ 2739.227921] VERIFY3(size <= spa_maxblocksize(spa)) failed (1048576 <= 131072) [ 2739.228023] PANIC at arc.c:1547:arc_buf_alloc() [ 2739.228073] Showing stack for process 3480 [ 2739.228078] CPU: 10 PID: 3480 Comm: zvol Tainted: P O 3.18.16-gentoo #1 [ 2739.228080] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014 10/22/2009 [ 2739.228082] 000000000000060b ffff8807f42079e8 ffffffff814a2b18 00000000000000fb [ 2739.228091] ffffffffa1fcaf05 ffff8807f42079f8 ffffffffa063eeb4 ffff8807f4207b78 [ 2739.228094] ffffffffa063ef5d ffff8807f4207a78 ffffffff00000030 ffff8807f4207b88 [ 2739.228097] Call Trace: [ 2739.228108] [] dump_stack+0x46/0x58 [ 2739.228118] [] spl_dumpstack+0x3d/0x3f [spl] [ 2739.228124] [] spl_panic+0xa7/0xda [spl] [ 2739.228129] [] ? spl_kmem_cache_alloc+0x649/0x668 [spl] [ 2739.228135] [] ? mutex_unlock+0x9/0xb [ 2739.228158] [] ? dbuf_rele_and_unlock+0x2bf/0x3bb [zfs] [ 2739.228161] [] ? mutex_unlock+0x9/0xb [ 2739.228173] [] ? dbuf_rm_spill+0x749/0x786 [zfs] [ 2739.228185] [] arc_buf_alloc+0x53/0x165 [zfs] [ 2739.228198] [] dbuf_read+0x2cf/0x71a [zfs] [ 2739.228214] [] dmu_buf_rele_array+0x242/0x415 [zfs] [ 2739.228229] [] dmu_buf_hold_array_by_bonus+0xc9/0xe8 [zfs] [ 2739.228244] [] dmu_read_req+0x49/0xe9 [zfs] [ 2739.228258] [] zrl_is_locked+0x1267/0x1593 [zfs] [ 2739.228266] [] taskq_create+0x4e3/0x648 [spl] [ 2739.228272] [] ? wake_up_process+0x34/0x34 [ 2739.228277] [] ? taskq_create+0x276/0x648 [spl] [ 2739.228282] [] kthread+0xcd/0xd5 [ 2739.228285] [] ? kthread_create_on_node+0x16c/0x16c [ 2739.228289] [] ret_from_fork+0x58/0x90 [ 2739.228292] [] ? kthread_create_on_node+0x16c/0x16c ``` I would imagine that if the pool isn't upgraded we don't want to allow the creation of larger than 128k blocksize.

jeffv03 · June 3, 2025, 1:40am

more infos :

min@truenas[~]$ sudo zpool import
  pool: VoxelZ2
    id: 10054978536421108484
 state: DEGRADED
status: One or more devices were being resilvered.
action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
config:

        VoxelZ2                                     DEGRADED
          raidz2-0                                  DEGRADED
            aab036a6-7e2b-4632-b16d-cd7866845e5e    ONLINE
            7a8f8da1-836b-4a4c-9eaa-2aa29da2beb7    ONLINE
            a10d9ac9-ef2c-4ede-9e6b-eba01e901361    ONLINE
            9f2dce88-cde7-4923-b0bb-bf4de890ea52    ONLINE
            replacing-4                             DEGRADED
              b74d4b85-0436-4814-8515-b405d7e40e24  UNAVAIL
              181fbb56-496c-4540-a107-d35504e0d61b  ONLINE
            spare-5                                 ONLINE
              96d445ec-e8b2-42ce-8f1a-63ca963eead9  ONLINE
              67741fad-0683-4df2-ad52-c0c30405c419  ONLINE
            b4dbc538-a334-417c-9b6a-7265693bc710    ONLINE
            fd7f4ef7-237a-48c0-8a72-bebaf1cd3fb5    ONLINE
            83e62634-a0a5-4c2f-abb8-10b763833042    ONLINE
            ea3b911a-d478-4341-b4cd-4549385fcdde    ONLINE
            52c0cc19-1576-4186-b41f-8ee005e7ebaf    ONLINE
            ee9542ae-6266-4615-bcaa-6a0f06dafb87    ONLINE
            0b764a40-fd5d-4631-922d-625178717347    ONLINE
            e0df314c-cc1e-460e-9591-842264306d5b    ONLINE
            af112e1e-211c-4061-a0bf-4e7f943903ca    ONLINE
            c2c7b568-7eff-4f85-998b-eb4e1ac51897    ONLINE
            ebd9e063-ebb9-4bdb-9ac6-3302a748390a    ONLINE
        cache
          b7523e48-e07b-4fbb-94c5-c6de97a7d33c
        spares
          67741fad-0683-4df2-ad52-c0c30405c419

exp :

the ‘unavail disk’ is the one that died for good (on v24.10.xxx, no spin no life it is not present in the nas. re turned to seller for replacement) it is being replaced (currently when not crashing it was at 77% done)
the ‘spare-5’ disk replace a ‘degraded’ but not dead disk (just beggining)

now as said if I issue a ‘sudo zpool import ‘with -f or not) I get the kernel panic now after some times (30’ to 45’) and a reboot of the TrueNAS

what can I do ?
is it possible to ‘verify’ or ’ validate’ a non yet imported zpool ?

is there a way to ‘offline’ a disk that is not present anymore (will get rid of the ‘unavail’)

jeffv03 · June 3, 2025, 5:58pm

Hum ? is this a return of really 10 years old bug (2015)
to zfs in v25.04.0 and up of 2025 ???

neofusion · June 3, 2025, 6:02pm

Not necessarily, just that whatever happened to your pool triggered a similar reading error message.

jeffv03 · June 9, 2025, 9:14pm

well, ok.
How can we find what happened and correct this to be able to import the pool (even read only ?)

Captain_Morgan · June 10, 2025, 10:44pm

Since there is no detailed record of the events, I doubt we’ll work out what happened.

I would check if DRAM is ECC protected. That is a source of pool corruption.

I’d suggest you start a new thread with a more useful headline (or change headline).

Kernel Panic on Pool Import - SPA_MaxBlocksize is enormous

And perhaps do a general google/AI search on that ZFS issue. I’ve never seen it before.

jeffv03 · June 11, 2025, 6:23am

Done as said by Captain
no really a solution but please do not reply to this thread anymore, thanks
so closing this one.