VMs stalling on Dragonfish when using NFS to host

Stux · May 30, 2024, 1:24am

PSA:

A guest VM may randomly stall/lockup on Dragonfish 24.04.0 and 24.04.1 when using NFS to access the TrueNAS host.

But work fine on Cobia. Its not swap related. No fix is planned.

https://ixsystems.atlassian.net/browse/NAS-129154

NickF1227 · May 30, 2024, 1:33am

Can you PM me a debug? I haven’t run into this and I have quite a few VMs.

Stux · May 30, 2024, 1:50am

check your inbox

on my other system I migrated to sandboxes before this became obvious I think.

Stux · May 31, 2024, 4:00am

So, I think I found the smoking gun

Turns out the stall is due to i/o blocking. The blocking i/o is nfs.

May 29 15:49:29 ubuntu20 kernel: [10226.479017] nfs: server titan OK
May 29 15:49:30 ubuntu20 kernel: [10226.858567] eth0: renamed from vethd414972
May 29 15:49:30 ubuntu20 kernel: [10226.874745] IPv6: ADDRCONF(NETDEV_CHANGE): veth2230572: link becomes ready
May 29 15:49:30 ubuntu20 kernel: [10226.874817] br-mailcow: port 9(veth2230572) entered blocking state
May 29 15:49:30 ubuntu20 kernel: [10226.874819] br-mailcow: port 9(veth2230572) entered forwarding state
May 29 15:50:36 ubuntu20 kernel: [10293.621680] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:36 ubuntu20 kernel: [10293.621906] vethcac8b5a: renamed from eth0
May 29 15:50:37 ubuntu20 kernel: [10293.678144] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.681354] device veth9feea49 left promiscuous mode
May 29 15:50:37 ubuntu20 kernel: [10293.681362] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.783207] br-mailcow: port 8(veth6743b54) entered blocking state
May 29 15:50:37 ubuntu20 kernel: [10293.783209] br-mailcow: port 8(veth6743b54) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.783354] device veth6743b54 entered promiscuous mode
May 29 15:50:37 ubuntu20 kernel: [10293.783510] br-mailcow: port 8(veth6743b54) entered blocking state
May 29 15:50:37 ubuntu20 kernel: [10293.783512] br-mailcow: port 8(veth6743b54) entered forwarding state
May 29 15:50:37 ubuntu20 kernel: [10294.624874] br-mailcow: port 8(veth6743b54) entered disabled state
May 29 15:51:29 ubuntu20 kernel: [10346.615706] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615729] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615747] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615758] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615762] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615768] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615771] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615810] nfs: server titan not responding, still trying
May 29 15:51:32 ubuntu20 kernel: [10348.825303] nfs: server titan not responding, still trying
May 29 15:51:49 ubuntu20 kernel: [10365.915694] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942782] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942786] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942797] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.973157] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.979302] call_decode: 7 callbacks suppressed
May 29 15:52:38 ubuntu20 kernel: [10414.979304] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.980113] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.981143] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.981791] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.981837] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982013] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982082] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982172] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982368] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982402] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982417] nfs: server titan OK
May 29 15:52:39 ubuntu20 kernel: [10415.863409] nfs: server titan not responding, still trying
May 29 15:52:39 ubuntu20 kernel: [10415.987391] nfs: server titan not responding, still trying
May 29 15:52:39 ubuntu20 kernel: [10416.699607] nfs: server titan not responding, still trying
May 29 15:52:40 ubuntu20 kernel: [10416.799589] nfs: server titan not responding, still trying
May 29 15:52:40 ubuntu20 kernel: [10416.799611] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.900659] rpc_check_timeout: 1 callbacks suppressed
May 29 15:52:43 ubuntu20 kernel: [10419.900661] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.964720] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.964745] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10420.248755] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913333] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913349] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913360] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913364] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913370] nfs: server titan not responding, still trying
May 29 15:52:47 ubuntu20 kernel: [10423.801887] nfs: server titan not responding, still trying
May 29 15:52:49 ubuntu20 kernel: [10426.074555] rpc_check_timeout: 5 callbacks suppressed
May 29 15:52:49 ubuntu20 kernel: [10426.074557] nfs: server titan not responding, still trying
May 29 15:52:50 ubuntu20 kernel: [10427.707147] nfs: server titan not responding, still trying
May 29 15:53:00 ubuntu20 kernel: [10437.498807] nfs: server titan not responding, still trying
May 29 15:53:00 ubuntu20 kernel: [10437.498834] nfs: server titan not responding, still trying
May 29 15:53:15 ubuntu20 kernel: [10452.347779] nfs: server titan not responding, still trying
May 29 15:53:17 ubuntu20 kernel: [10454.647433] nfs: server titan not responding, still trying
May 29 15:53:51 ubuntu20 kernel: [10487.737265] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662046] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662080] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662117] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662123] nfs: server titan not responding, still trying
May 29 15:54:17 ubuntu20 kernel: [10514.298323] INFO: task portainer:3365 blocked for more than 120 seconds.
May 29 15:54:17 ubuntu20 kernel: [10514.298402]       Not tainted 5.4.0-1088-aws #96-Ubuntu
May 29 15:54:17 ubuntu20 kernel: [10514.298443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 15:54:17 ubuntu20 kernel: [10514.298524] portainer       D    0  3365   1980 0x00000320
May 29 15:54:17 ubuntu20 kernel: [10514.298532] Call Trace:
May 29 15:54:17 ubuntu20 kernel: [10514.298565]  __schedule+0x2e3/0x740
May 29 15:54:17 ubuntu20 kernel: [10514.298569]  schedule+0x42/0xb0
May 29 15:54:17 ubuntu20 kernel: [10514.298583]  io_schedule+0x16/0x40
May 29 15:54:17 ubuntu20 kernel: [10514.298588]  wait_on_page_bit+0x11c/0x200
May 29 15:54:17 ubuntu20 kernel: [10514.298591]  ? file_fdatawait_range+0x30/0x30
May 29 15:54:17 ubuntu20 kernel: [10514.298606]  wait_on_page_writeback+0x43/0x90
May 29 15:54:17 ubuntu20 kernel: [10514.298609]  __filemap_fdatawait_range+0x98/0x100
May 29 15:54:17 ubuntu20 kernel: [10514.298614]  file_write_and_wait_range+0xa0/0xc0
May 29 15:54:17 ubuntu20 kernel: [10514.298642]  nfs_file_fsync+0x93/0x1a0 [nfs]
May 29 15:54:17 ubuntu20 kernel: [10514.298647]  vfs_fsync_range+0x49/0x80
May 29 15:54:17 ubuntu20 kernel: [10514.298650]  do_fsync+0x3d/0x70
May 29 15:54:17 ubuntu20 kernel: [10514.298653]  __x64_sys_fdatasync+0x17/0x20
May 29 15:54:17 ubuntu20 kernel: [10514.298658]  do_syscall_64+0x57/0x190
May 29 15:54:17 ubuntu20 kernel: [10514.298661]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 29 15:54:17 ubuntu20 kernel: [10514.298669] RIP: 0033:0x403ace
May 29 15:54:17 ubuntu20 kernel: [10514.298676] Code: Bad RIP value.
May 29 15:54:17 ubuntu20 kernel: [10514.298678] RSP: 002b:000000c00056b9c8 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
May 29 15:54:17 ubuntu20 kernel: [10514.298681] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000403ace
May 29 15:54:17 ubuntu20 kernel: [10514.298682] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
May 29 15:54:17 ubuntu20 kernel: [10514.298684] RBP: 000000c00056ba08 R08: 0000000000000000 R09: 0000000000000000
May 29 15:54:17 ubuntu20 kernel: [10514.298685] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
May 29 15:54:17 ubuntu20 kernel: [10514.298686] R13: 0000000000000000 R14: 000000c000683860 R15: 0000000000001000

Stux · May 31, 2024, 4:14am

root@ubuntu20:/var/log# grep 'May 25' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 26' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 27' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
1266
root@ubuntu20:/var/log# grep 'May 28' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
1046
root@ubuntu20:/var/log# grep 'May 29' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
233
root@ubuntu20:/var/log# grep 'May 30' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 31' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0

guess which days I was running DragonFish on.

(same results when searching for not responding, still trying)

Stux · June 5, 2024, 11:20pm

Thanks for the detailed ticket but we do not have the time or resources to investigate the various complexities of running a virtualized guest that accesses a resource on the hypervisor for which it is being hosted on.

Apparently, attempting to mount an NFS or SMB share from TrueNAS in a guest VM is no longer a supported configuration.

Also, virtfs is not supported and the recommended solution is to use NFS or SMB. Go figure.

Stux · June 6, 2024, 12:03am

This is the documented way to access nas files/directories from a vm, which is not supported.

If you want to access your TrueNAS SCALE directories from a VM, you have multiple options:

If you have only one physical interface, you must create a bridge interface for the VM.

If your system has more than one physical interface you can assign your VMs to a NIC other than the primary one your TrueNAS server uses. This method makes communication more flexible but does not offer the potential speed of a bridge.

Linux VMs can access TrueNAS storage using FTP, SMB, and NFS.

BitByteBit · June 6, 2024, 6:47am

@Stux just confirming this issue is related specifically to a TrueNAS VM running on the same TrueNAS host, and accessing the host storage as the disk for the VM over NFS?

I’ve been happily running TrueNAS as a VM (with HBA passthrough of course) and using it (the TrueNAS VM) to server up NFS shares for use by VM disks which are running on the same hypervisor thats hosting the TrueNAS VM without issues (for years on CORE, and so far no issues after migrating to Dragonfish)…

Stux · June 6, 2024, 6:52am

That is not confirmed.

It occurs in the scenario. It may occur in others. No further investigation was performed “closed with no changes”

Bad news, apparently that’s no longer supported either:

we do not have the time or resources to investigate the various complexities of running a virtualized truenas that has hardware being passed through to it

jtmusson · June 6, 2024, 7:17am

I have had an app (handbrake) stall which is accessing an NFS share.

I hadn’t noted any such stalling before upgrading to 24.04 Dragonfish. I had a theory it was happening due to ARC cache issues but it has happened again since 24.04.1.

Need to do some digging in logs to determine if any NFS errors as noted in the attached bug report. It may be the NFS issues are impacting more than vms?

Stux · June 6, 2024, 8:28am

I haven’t found anything in the TrueNAS host side… but I don’t really know what I’m looking for

From what I see, on the guest side, there is definitely an issue communicating with the NFS server, which is intermittent… and does not occur when the NFS server is TrueNAS 23.10.2, but does when it iss 24.04.0, 24.04.1 or 24.04.1.1

No pings are lost between the client and host while the NFS is not working.

Restarting the guest/client… gets rid of the problem temporarily.

BitByteBit · June 6, 2024, 11:20am

Well that’s not ideal… especially since I just did the migration from CORE
At least my SCALE VM has been booting fine on Dragonfish-24.04.1.1 including multiple reboots since the upgrade from 24.04.1…
Hypervisor is XCP-ng (Xen) if that makes any difference… fingers crossed it keeps on working

Vollans · June 6, 2024, 8:49pm

Ah! That sounds like why I had trouble last weekend trying to migrate from my Roon server being a standalone box that accessed an SMB share on the NAS to being a VM on the same NAS using SMB shares to access the music. It constantly froze and was totally unusable. At the time I thought I was just me being a newbie numpty and decided to walk away slowly and pretend I hadn’t tried…

BitByteBit · June 6, 2024, 9:26pm

Interesting! Forgot to add that I’ve had no issues accessing SMB shares hosted on my TrueNAS SCALE VM either, so it doesn’t seem to be universally affecting everyone…

madmatt · November 1, 2024, 9:57pm

I am encountering the same issue:

TrueNas Installed on a virtualized platform (Proxmox)
NFS shares serving multiple clients
- Network booting Raspebrry Pi4 (2x) with root fs on an NFS share
- Proxmox based VM running Docker with volumes hosted on NFS share

System works flawlessly and stable since original Truenas was Freenas, used to run the Docker VM on Bhive locally, then migrated everything to proxmox and Scale up to Cobia, exibiths the same issues (NFS timeouts on the clients, silent hang of the NFS services , no logging on Truenas) when moving to Dragonfish
I originally suspected a clash between Proxmox settings and Truenas VM settngs but reverting to Cobia returns the system to a stable condition

No logs on the TrueNas system when locks occur, nothing I could find in the new kernel used in dragonfish that could trigger such a behaviour and manifest itself only on the NFS side. Ping still works, no dropped packets, other services work as well, NFS mounted share ‘hang’ altogether for up to a couple minutes or exhibit poor perfromance in general that make the doecker services timeout/slow down so much that things start to break (OpenHab, Mosquitto,Unifi controller, Node-Red, Home automation custom containers, EmonCms, MariaDB) …

I suspect something changed in latest kernels that surface this behaviour on the NFS service only, but have had no luck going through standard kernel bug reports

Stux · November 1, 2024, 10:12pm

The latest for me is that with Dragonfosh 24.04.2, and updating the Ubuntu 22.04 kernel to the latest, ~~and disabling fs leases~~, I am having no hangs (so far).

~~I’m not sure if the fs leases are required to be disabled.~~

EDIT: I did not have fs leases disabled, ergo, they are not required to be disabled.

I’m slowly working to prove things, but if something doesn’t happen you have to wait weeks to be sure it won’t

Stux · November 14, 2024, 1:30am

Closing the loop on this, 24.04.2 seems to have resolved the issue. I’ve experienced 25 days of continuous uptime without the issue occurring, and have now updated to 24.10.2.