TrueNAS NFS random crash

Hi,

I have the same issue with NFS (True NAS Dragonfish-24.04.2).

Client mount options (ubuntu 24.04):

nfs fsc,rw,hard,timeo=30,tcp,rsize=32768,wsize=32768,noatime,nodiratime,auto

dmesg output:

[136662.821391] INFO: task nfsd:3639 blocked for more than 1208 seconds.
[136662.821927]       Tainted: P          IOE      6.6.32-production+truenas #1
[136662.822377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[136662.822837] task:nfsd            state:D stack:0     pid:3639  ppid:2      flags:0x00004000
[136662.823388] Call Trace:
[136662.823891]  <TASK>
[136662.824493]  __schedule+0x349/0x950
[136662.825031]  schedule+0x5b/0xa0
[136662.825656]  schedule_timeout+0x151/0x160
[136662.826367]  wait_for_completion+0x86/0x170
[136662.826990]  __flush_workqueue+0x144/0x440
[136662.827511]  ? __queue_work+0x1bd/0x410
[136662.827976]  nfsd4_destroy_session+0x1ce/0x2b0 [nfsd]
[136662.828575]  nfsd4_proc_compound+0x359/0x680 [nfsd]
[136662.829101]  nfsd_dispatch+0xf1/0x200 [nfsd]
[136662.829636]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[136662.830174]  svc_process_common+0x2f8/0x6f0 [sunrpc]
[136662.830722]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[136662.831234]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[136662.831739]  svc_process+0x131/0x180 [sunrpc]
[136662.832319]  nfsd+0x84/0xd0 [nfsd]
[136662.832877]  kthread+0xe8/0x120
[136662.833343]  ? __pfx_kthread+0x10/0x10
[136662.833987]  ret_from_fork+0x34/0x50
[136662.834470]  ? __pfx_kthread+0x10/0x10
[136662.835061]  ret_from_fork_asm+0x1b/0x30
[136662.835502]  </TASK>
[136662.835926] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
[137718.589105] perf: interrupt took too long (3929 > 3923), lowering kernel.perf_event_max_sample_rate to 50750

After this error, NFS is unable. Client canā€™t connect after TrueNAS reboot.

Switched to TrueNAS Core 13.3, so far so good.
Uptime: 18 days, 2:31 as of 16:17
This is the first time the uptime longer than one week without NFS died.

Anyway, until the upstream bug get solved, avoid TrueNAS Scale, the NFS feature is not useable.
Use TrueNAS Core instead, it works well.

1 Like

Thanks for posting back with this valuable feedback.

To get the list of ā€˜attachedā€™ NFS clients in SCALE:

midclt call nfs.get_nfs3_clients
midclt call nfs.get_nfs4_clients

I not done already and if possible please open a ticket on this.
Please provide a debug (ixdiagnostic).
Also, if possible, please provide reproduction steps.

Thanks, but I have switched to TrueNAS CORE 13.3 and it works great now, so it may difficult to get debug log now unless I switch back to TrueNAS SCALE.

I can only reply based on my memories now.

First, this hangs only occurs on nfs v4, nfs v3 works. But nfsv3 causes random bus error on file read/write, so I have to use nfs v4.

Second, in my memory, I can get the clients in the client list in the GUI, there are three types of client.

  1. Working clients
  2. Died clients
  3. Connecting Clients

In a random tick, the nfs server just dies(nfsd becomes D state and logs in dmesg). Then:
Existing clients may continue working, or died at any time. After it died, the ā€œlast handshakeā€ no longer update at server GUI.
No any new connection. If I connect to the server via new client, The client just hangs, and I can see it in server GUI, but stuck in a state(not establish state).

By the way, it seems this problem are not only on TrueNAS, but I saw some similar bug reports on debian or other linux based operation systems.

Just FYI: I am facing the same issue on TrueNAS Scale 24.04.2.2 with one Ubuntu 24.04.1 client.

loads of these messagesā€¦

messages:Oct 29 03:21:32 truenas1 kernel: task:nfsd            state:D stack:0     pid:4358  ppid:2      flags:0x00004000
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd4_destroy_session+0x1ce/0x2b0 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd4_proc_compound+0x359/0x680 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd_dispatch+0xf1/0x200 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd+0x84/0xd0 [nfsd]

Itā€™s very annoying, especially since I have to reboot the whole server (which takes forever, as killing the NFS-Server does not work and when the systemd timeout is reached it just increases more and more).
I mostly end up resetting the server after waiting for about 5 minutes which I really donā€™t like doing.
I only currently need this for my mediastack VM running docker. I am going to migrate these containers to EE natively as soon as I update so I hope I wonā€™t see this issue again.

I encountered the same issue as others in this thread regarding running TrueNAS-SCALE-24.04.2 with NFS shares. After some time, both the NFS clients and the NFS server would hang, and I noticed the nfsd hung_task_timeout_secs message in the kernel.log/syslog/dmesg.

With TrueNAS, ARC is configured to use all available memory by default. You can check the current values by running sudo arc_summary in the shell. I adjusted the zfs_arc_max value to 75-85% of my total available memory, and since making this change, I havenā€™t experienced the issue againā€”itā€™s been 33 days without any crashes.

To set the zfs_arc_max, go to System Settings => Advanced => Init/Shutdown Scripts => Add.

Type: Command
Command: echo 51539607552 >> /sys/module/zfs/parameters/zfs_arc_max
When: Post Init
Enabled: Check mark

1GB = 1024 x 1024 x 1024 = 1073741824
48GB = 48 x 1GB = 51539607552

1 Like

Stebu, Have you submitted a bug ticket to iX Systems on this?

No, I havenā€™t looked into submitting a bug ticket for this. To do so, Iā€™d need to create a new jira account, and honestly, Iā€™ve already gone through the process of setting up a forum account just to share what worked for me, hoping it might help others.

I tried, but seems I donā€™t have the permission to create issue on the jira

Create the issue as a bug rather than a defect.

Please upgrade to 24.10. There have been improvements to ARC configuration and management. It might positively effect the issues reported here.

I just wanted to chime in to say that I am experiencing the same issue (AMD EPYC 7453 processor on a Gigabyte MZ32-AR0 Rev. 3 motherboard). I encountered it on both 24.04 and 24.10. Reverting back to Cobia (23.10.2) with an otherwise identical configuration resolves the issue. (Unfortunately, because I reverted I wonā€™t have any further useful information for troubleshooting.)