TrueNAS NFS random crash

Javier · August 30, 2024, 12:12pm

Hi,

I have the same issue with NFS (True NAS Dragonfish-24.04.2).

Client mount options (ubuntu 24.04):

nfs fsc,rw,hard,timeo=30,tcp,rsize=32768,wsize=32768,noatime,nodiratime,auto

dmesg output:

[136662.821391] INFO: task nfsd:3639 blocked for more than 1208 seconds.
[136662.821927]       Tainted: P          IOE      6.6.32-production+truenas #1
[136662.822377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[136662.822837] task:nfsd            state:D stack:0     pid:3639  ppid:2      flags:0x00004000
[136662.823388] Call Trace:
[136662.823891]  <TASK>
[136662.824493]  __schedule+0x349/0x950
[136662.825031]  schedule+0x5b/0xa0
[136662.825656]  schedule_timeout+0x151/0x160
[136662.826367]  wait_for_completion+0x86/0x170
[136662.826990]  __flush_workqueue+0x144/0x440
[136662.827511]  ? __queue_work+0x1bd/0x410
[136662.827976]  nfsd4_destroy_session+0x1ce/0x2b0 [nfsd]
[136662.828575]  nfsd4_proc_compound+0x359/0x680 [nfsd]
[136662.829101]  nfsd_dispatch+0xf1/0x200 [nfsd]
[136662.829636]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[136662.830174]  svc_process_common+0x2f8/0x6f0 [sunrpc]
[136662.830722]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[136662.831234]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[136662.831739]  svc_process+0x131/0x180 [sunrpc]
[136662.832319]  nfsd+0x84/0xd0 [nfsd]
[136662.832877]  kthread+0xe8/0x120
[136662.833343]  ? __pfx_kthread+0x10/0x10
[136662.833987]  ret_from_fork+0x34/0x50
[136662.834470]  ? __pfx_kthread+0x10/0x10
[136662.835061]  ret_from_fork_asm+0x1b/0x30
[136662.835502]  </TASK>
[136662.835926] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
[137718.589105] perf: interrupt took too long (3929 > 3923), lowering kernel.perf_event_max_sample_rate to 50750

After this error, NFS is unable. Client can’t connect after TrueNAS reboot.

JKFox · October 6, 2024, 8:19am

Switched to TrueNAS Core 13.3, so far so good.
Uptime: 18 days, 2:31 as of 16:17
This is the first time the uptime longer than one week without NFS died.

Anyway, until the upstream bug get solved, avoid TrueNAS Scale, the NFS feature is not useable.
Use TrueNAS Core instead, it works well.

Johnny_Fartpants · October 6, 2024, 9:06am

Thanks for posting back with this valuable feedback.

Mark_Grimes · October 8, 2024, 7:13pm

To get the list of ‘attached’ NFS clients in SCALE:

midclt call nfs.get_nfs3_clients
midclt call nfs.get_nfs4_clients

Mark_Grimes · October 8, 2024, 8:01pm

I not done already and if possible please open a ticket on this.
Please provide a debug (ixdiagnostic).
Also, if possible, please provide reproduction steps.

JKFox · October 15, 2024, 7:25am

Thanks, but I have switched to TrueNAS CORE 13.3 and it works great now, so it may difficult to get debug log now unless I switch back to TrueNAS SCALE.

I can only reply based on my memories now.

First, this hangs only occurs on nfs v4, nfs v3 works. But nfsv3 causes random bus error on file read/write, so I have to use nfs v4.

Second, in my memory, I can get the clients in the client list in the GUI, there are three types of client.

Working clients
Died clients
Connecting Clients

In a random tick, the nfs server just dies(nfsd becomes D state and logs in dmesg). Then:
Existing clients may continue working, or died at any time. After it died, the “last handshake” no longer update at server GUI.
No any new connection. If I connect to the server via new client, The client just hangs, and I can see it in server GUI, but stuck in a state(not establish state).

By the way, it seems this problem are not only on TrueNAS, but I saw some similar bug reports on debian or other linux based operation systems.

TheColin21 · October 29, 2024, 6:50am

Just FYI: I am facing the same issue on TrueNAS Scale 24.04.2.2 with one Ubuntu 24.04.1 client.

loads of these messages…

messages:Oct 29 03:21:32 truenas1 kernel: task:nfsd            state:D stack:0     pid:4358  ppid:2      flags:0x00004000
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd4_destroy_session+0x1ce/0x2b0 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd4_proc_compound+0x359/0x680 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd_dispatch+0xf1/0x200 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
messages:Oct 29 03:21:32 truenas1 kernel:  nfsd+0x84/0xd0 [nfsd]

It’s very annoying, especially since I have to reboot the whole server (which takes forever, as killing the NFS-Server does not work and when the systemd timeout is reached it just increases more and more).
I mostly end up resetting the server after waiting for about 5 minutes which I really don’t like doing.
I only currently need this for my mediastack VM running docker. I am going to migrate these containers to EE natively as soon as I update so I hope I won’t see this issue again.

Stebu · October 29, 2024, 10:42pm

I encountered the same issue as others in this thread regarding running TrueNAS-SCALE-24.04.2 with NFS shares. After some time, both the NFS clients and the NFS server would hang, and I noticed the nfsd hung_task_timeout_secs message in the kernel.log/syslog/dmesg.

With TrueNAS, ARC is configured to use all available memory by default. You can check the current values by running sudo arc_summary in the shell. I adjusted the zfs_arc_max value to 75-85% of my total available memory, and since making this change, I haven’t experienced the issue again—it’s been 33 days without any crashes.

To set the zfs_arc_max, go to System Settings => Advanced => Init/Shutdown Scripts => Add.

Type: Command
Command: echo 51539607552 >> /sys/module/zfs/parameters/zfs_arc_max
When: Post Init
Enabled: Check mark

1GB = 1024 x 1024 x 1024 = 1073741824
48GB = 48 x 1GB = 51539607552

SmallBarky · November 4, 2024, 6:38pm

Stebu, Have you submitted a bug ticket to iX Systems on this?

Stebu · November 9, 2024, 5:54pm

No, I haven’t looked into submitting a bug ticket for this. To do so, I’d need to create a new jira account, and honestly, I’ve already gone through the process of setting up a forum account just to share what worked for me, hoping it might help others.

JKFox · November 11, 2024, 7:31pm

I tried, but seems I don’t have the permission to create issue on the jira

DjP-iX · November 11, 2024, 8:39pm

Create the issue as a bug rather than a defect.

Mark_Grimes · November 14, 2024, 9:05pm

Please upgrade to 24.10. There have been improvements to ARC configuration and management. It might positively effect the issues reported here.

WKHarmon · November 14, 2024, 10:33pm

I just wanted to chime in to say that I am experiencing the same issue (AMD EPYC 7453 processor on a Gigabyte MZ32-AR0 Rev. 3 motherboard). I encountered it on both 24.04 and 24.10. Reverting back to Cobia (23.10.2) with an otherwise identical configuration resolves the issue. (Unfortunately, because I reverted I won’t have any further useful information for troubleshooting.)

doofy · November 28, 2024, 4:23am

Same issue for my system. I experience the same weird NFS problems. SMB works though.
I am running on the latest version of Scale eal, and has all of the latest updates installed. The only fix for getting the system responsive again, is a reboot.

SmallBarky · November 28, 2024, 4:34am

Did you submit a bug to iX Systems?

In the GUI at the upper right is a . Feedback / Report a Bug. You can attach a debug dump there.
Other option is Report a Bug at top of forums on upper right. Attach debug dump after creating ticket, etc.

kikotte · December 8, 2024, 12:55pm

I’m having problems with NFS hanging, I’m using TrueNAS-13.0-U6.3.

I am trying to transfer files from my old FreeNAS to TrueNAS via NFS have created a VM in TrueNAS.

It causes the entire WC to lock up.

BITC01 · December 17, 2024, 4:35pm

Same issue here . Fresh install of 24.10.02 and seeing the exact same thing as the op. Followed Stebu advice and waiting now to see if that fixes the issue.

OS: Truenas Scale ElectricEel-24.10.0.2
System : Supermicro SSG-5029P-E1CTR12L
CPU : Xeon 3104 bronze
RAM: 32Gb
Zpool : 6 mirrored VDEV 2x6TB ea with Log on local nvme

WKHarmon · December 26, 2024, 7:11pm

I followed Stebu’s advice of setting zfs_arc_max and have been stable on 24.10.1 for the last 8 days.

BITC01 · December 27, 2024, 4:34pm

Ended up having to roll back to 23.10.2 to achieve stability here. So far running 6 days on that version without incident.