TrueNAS NFS random crash

I’ve been experiencing the same behaviour. The NFS server just starts freezing randomly, and all the clients will hang indefinitely as well. The only way is to reboot the pc.

I’ve tried several releases of Dragonfish, none of them solved. I’ve also updated to Electric Eel last week and tried Stebu’s fix to limit the amount of RAM assigned to ARC. The system worked stable for about 8 days, and then it happened again.

If possible, I’ll try Cobia to see if that fixes it.

If possible, please try the 25.04 (Fangtooth) BETA. It’s running with Linux 6.12 and has possible fixes for issues reported here.

Unfortunately, Fangtooth also has the same crash issue. Just had a nfs crash today and had to reboot TrueNAS.

If possible could you post your TrueNAS OS version and the log message associated with the crash?

Edition:
 Community 
Version:
25.04.0
------------[ cut here ]------------
cb_status=-521 tk_status=-10036
WARNING: CPU: 1 PID: 14893 at fs/nfsd/nfs4callback.c:1339 nfsd4_cb_done+0x4e5/0x550 [nfsd]
Modules linked in: tcp_diag(E) inet_diag(E) rpcsec_gss_krb5(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) stp(E) llc(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) nft_compat(E) nf_tables(E) libcrc32c(E) crc32c_generic(E) nfnetlink(E) nvme_fabrics(E) nvme_core(E) overlay(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vmw_vsock_vmci_transport(E) vsock(E) binfmt_misc(E) ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) dca(E) ib_core(E) intel_rapl_msr(E) intel_rapl_common(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) gf128mul(E) crypto_simd(E) cryptd(E) rapl(E) vmw_balloon(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) pcspkr(E) vmwgfx(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ac(E) vmw_vmci(E) button(E) joydev(E) evdev(E) serio_raw(E) sg(E) loop(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) drm(E) efi_pstore(E) configfs(E) sunrpc(E)
 ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) hid_generic(E) usbhid(E) hid(E) ahci(E) ata_generic(E) ahciem(E) libahci(E) ata_piix(E) sd_mod(E) mpt3sas(E) raid_class(E) vmw_pvscsi(E) libata(E) scsi_transport_sas(E) uhci_hcd(E) ehci_pci(E) crc32_pclmul(E) ehci_hcd(E) crc32c_intel(E) psmouse(E) vmxnet3(E) usbcore(E) usb_common(E) scsi_mod(E) i2c_piix4(E) i2c_smbus(E) scsi_common(E)
CPU: 1 UID: 0 PID: 14893 Comm: kworker/u10:1 Tainted: P           OE      6.12.15-production+truenas #1
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Workqueue: rpciod rpc_async_schedule [sunrpc]
RIP: 0010:nfsd4_cb_done+0x4e5/0x550 [nfsd]
Code: 8b 33 45 89 fe e9 d1 fb ff ff 80 3d e7 da 01 00 00 0f 85 f3 fb ff ff 48 c7 c7 58 0b 28 c1 c6 05 d3 da 01 00 01 e8 8b 74 6b e4 <0f> 0b 8b 73 54 e9 d6 fb ff ff 0f 1f 44 00 00 e9 59 fd ff ff 41 89
RSP: 0018:ffffbbd0c1babdc0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff934ff65eb288 RCX: 0000000000000027
RDX: ffff9352efd21788 RSI: 0000000000000001 RDI: ffff9352efd21780
RBP: ffff934fd9ecc548 R08: 0000000000000000 R09: 0000000000000003
R10: ffffbbd0c1babc50 R11: ffffffffa74cc0a8 R12: ffff934fd9ecc548
R13: ffff934fe9fb0500 R14: 0000000000000001 R15: ffff9352d1383400
FS:  0000000000000000(0000) GS:ffff9352efd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0003e7000 CR3: 0000000118616002 CR4: 00000000000606f0
Call Trace:
 <TASK>
 ? __warn+0x89/0x130
 ? nfsd4_cb_done+0x4e5/0x550 [nfsd]
 ? report_bug+0x164/0x190
 ? handle_bug+0x58/0x90
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? nfsd4_cb_done+0x4e5/0x550 [nfsd]
 ? __pfx_rpc_exit_task+0x10/0x10 [sunrpc]
 rpc_exit_task+0x5f/0x180 [sunrpc]
 __rpc_execute+0xb5/0x490 [sunrpc]
 rpc_async_schedule+0x2f/0x40 [sunrpc]
 process_one_work+0x180/0x3a0
 worker_thread+0x2da/0x420
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xcf/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
---[ end trace 0000000000000000 ]---

Using NFS4 with Kerberos authentication.

Appears to be a thrown warning regarding a bad file handle, not a crash.

This same issue is being actively worked upstream: 219737 – warning in nfsd4_cb_done
We can expect some relief from this in future releases.

Did the system stop functioning in some manner?

I encountered this warning again, but Truenas itself appears to be functioning normally. However, when the warning shows up, one of the NFS clients gets stuck in a “Courtesy” state. Restarting the NFS service (turning it off and back on) seems to temporarily resolve the issue.

A fix for this has been added to the upstream LTS kernel. It will be available in TrueNAS with the Goldeye (25.10) release.

1 Like

A narrow fix for this has been backported to the next update release for Fangtooth (25.04.1).

2 Likes

I’ve seen this crash a couple of times in the last few days on my newly built 25.04.1 NAS. Should this backport have fixed it?

The fix included in Fangtooth was narrow so possibly not. The commit included in 25.04.1: cedfbb92cf97a6bff3d25633001d9c44442ee854: NFSD: fix hang in nfsd4_shutdown_callback. Other fixes related to the original bug report will be in the Goldeye release (25.10). The original issue remains open so there is the potential for more updates.

I am also running into the same issue with the “task nfsd:3639 blocked for more than xxx seconds” error message, and clients getting stuck accessing an NFSv4 share.

I was previously running TrueNAS Scale 24.04, and after reading this thread I updated to 25.04.1. Running on a Ugreen DXP8800 Plus with an i5-1235U and 96GB RAM.

I also ran the command suggested by Stebu to limit the ARC to 50% of my RAM, but that also didn’t help.

There seem to be two different errors mixed in this thread, though, the “task hung for more than X seconds”, and the other one where there’s a full crash dump, if I’m understanding this correctly?

Are these both going to be fixed with 25.10? Given that the error still occurs in 25.04.1, can more of the fixes be backported to that branch or do we wait until 25.10 to see if that hopefully fixes the issue?