PSA: VMs may stall on Dragonfish when using NFS to host

Stux · May 30, 2024, 1:24am

PSA:

A guest VM may randomly stall/lockup on Dragonfish 24.04.0 and 24.04.1 when using NFS to access the TrueNAS host.

But work fine on Cobia. Its not swap related. No fix is planned.

https://ixsystems.atlassian.net/browse/NAS-129154

NickF1227 · May 30, 2024, 1:33am

Can you PM me a debug? I haven’t run into this and I have quite a few VMs.

Stux · May 30, 2024, 1:50am

check your inbox

on my other system I migrated to sandboxes before this became obvious I think.

Stux · May 31, 2024, 4:00am

So, I think I found the smoking gun

Turns out the stall is due to i/o blocking. The blocking i/o is nfs.

May 29 15:49:29 ubuntu20 kernel: [10226.479017] nfs: server titan OK
May 29 15:49:30 ubuntu20 kernel: [10226.858567] eth0: renamed from vethd414972
May 29 15:49:30 ubuntu20 kernel: [10226.874745] IPv6: ADDRCONF(NETDEV_CHANGE): veth2230572: link becomes ready
May 29 15:49:30 ubuntu20 kernel: [10226.874817] br-mailcow: port 9(veth2230572) entered blocking state
May 29 15:49:30 ubuntu20 kernel: [10226.874819] br-mailcow: port 9(veth2230572) entered forwarding state
May 29 15:50:36 ubuntu20 kernel: [10293.621680] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:36 ubuntu20 kernel: [10293.621906] vethcac8b5a: renamed from eth0
May 29 15:50:37 ubuntu20 kernel: [10293.678144] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.681354] device veth9feea49 left promiscuous mode
May 29 15:50:37 ubuntu20 kernel: [10293.681362] br-mailcow: port 8(veth9feea49) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.783207] br-mailcow: port 8(veth6743b54) entered blocking state
May 29 15:50:37 ubuntu20 kernel: [10293.783209] br-mailcow: port 8(veth6743b54) entered disabled state
May 29 15:50:37 ubuntu20 kernel: [10293.783354] device veth6743b54 entered promiscuous mode
May 29 15:50:37 ubuntu20 kernel: [10293.783510] br-mailcow: port 8(veth6743b54) entered blocking state
May 29 15:50:37 ubuntu20 kernel: [10293.783512] br-mailcow: port 8(veth6743b54) entered forwarding state
May 29 15:50:37 ubuntu20 kernel: [10294.624874] br-mailcow: port 8(veth6743b54) entered disabled state
May 29 15:51:29 ubuntu20 kernel: [10346.615706] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615729] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615747] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615758] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615762] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615768] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615771] nfs: server titan not responding, still trying
May 29 15:51:29 ubuntu20 kernel: [10346.615810] nfs: server titan not responding, still trying
May 29 15:51:32 ubuntu20 kernel: [10348.825303] nfs: server titan not responding, still trying
May 29 15:51:49 ubuntu20 kernel: [10365.915694] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942782] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942786] nfs: server titan not responding, still trying
May 29 15:52:37 ubuntu20 kernel: [10413.942797] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.973157] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.979302] call_decode: 7 callbacks suppressed
May 29 15:52:38 ubuntu20 kernel: [10414.979304] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.980113] nfs: server titan not responding, still trying
May 29 15:52:38 ubuntu20 kernel: [10414.981143] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.981791] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.981837] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982013] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982082] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982172] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982368] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982402] nfs: server titan OK
May 29 15:52:38 ubuntu20 kernel: [10414.982417] nfs: server titan OK
May 29 15:52:39 ubuntu20 kernel: [10415.863409] nfs: server titan not responding, still trying
May 29 15:52:39 ubuntu20 kernel: [10415.987391] nfs: server titan not responding, still trying
May 29 15:52:39 ubuntu20 kernel: [10416.699607] nfs: server titan not responding, still trying
May 29 15:52:40 ubuntu20 kernel: [10416.799589] nfs: server titan not responding, still trying
May 29 15:52:40 ubuntu20 kernel: [10416.799611] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.900659] rpc_check_timeout: 1 callbacks suppressed
May 29 15:52:43 ubuntu20 kernel: [10419.900661] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.964720] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10419.964745] nfs: server titan not responding, still trying
May 29 15:52:43 ubuntu20 kernel: [10420.248755] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913333] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913349] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913360] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913364] nfs: server titan not responding, still trying
May 29 15:52:45 ubuntu20 kernel: [10421.913370] nfs: server titan not responding, still trying
May 29 15:52:47 ubuntu20 kernel: [10423.801887] nfs: server titan not responding, still trying
May 29 15:52:49 ubuntu20 kernel: [10426.074555] rpc_check_timeout: 5 callbacks suppressed
May 29 15:52:49 ubuntu20 kernel: [10426.074557] nfs: server titan not responding, still trying
May 29 15:52:50 ubuntu20 kernel: [10427.707147] nfs: server titan not responding, still trying
May 29 15:53:00 ubuntu20 kernel: [10437.498807] nfs: server titan not responding, still trying
May 29 15:53:00 ubuntu20 kernel: [10437.498834] nfs: server titan not responding, still trying
May 29 15:53:15 ubuntu20 kernel: [10452.347779] nfs: server titan not responding, still trying
May 29 15:53:17 ubuntu20 kernel: [10454.647433] nfs: server titan not responding, still trying
May 29 15:53:51 ubuntu20 kernel: [10487.737265] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662046] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662080] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662117] nfs: server titan not responding, still trying
May 29 15:54:11 ubuntu20 kernel: [10508.662123] nfs: server titan not responding, still trying
May 29 15:54:17 ubuntu20 kernel: [10514.298323] INFO: task portainer:3365 blocked for more than 120 seconds.
May 29 15:54:17 ubuntu20 kernel: [10514.298402]       Not tainted 5.4.0-1088-aws #96-Ubuntu
May 29 15:54:17 ubuntu20 kernel: [10514.298443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 15:54:17 ubuntu20 kernel: [10514.298524] portainer       D    0  3365   1980 0x00000320
May 29 15:54:17 ubuntu20 kernel: [10514.298532] Call Trace:
May 29 15:54:17 ubuntu20 kernel: [10514.298565]  __schedule+0x2e3/0x740
May 29 15:54:17 ubuntu20 kernel: [10514.298569]  schedule+0x42/0xb0
May 29 15:54:17 ubuntu20 kernel: [10514.298583]  io_schedule+0x16/0x40
May 29 15:54:17 ubuntu20 kernel: [10514.298588]  wait_on_page_bit+0x11c/0x200
May 29 15:54:17 ubuntu20 kernel: [10514.298591]  ? file_fdatawait_range+0x30/0x30
May 29 15:54:17 ubuntu20 kernel: [10514.298606]  wait_on_page_writeback+0x43/0x90
May 29 15:54:17 ubuntu20 kernel: [10514.298609]  __filemap_fdatawait_range+0x98/0x100
May 29 15:54:17 ubuntu20 kernel: [10514.298614]  file_write_and_wait_range+0xa0/0xc0
May 29 15:54:17 ubuntu20 kernel: [10514.298642]  nfs_file_fsync+0x93/0x1a0 [nfs]
May 29 15:54:17 ubuntu20 kernel: [10514.298647]  vfs_fsync_range+0x49/0x80
May 29 15:54:17 ubuntu20 kernel: [10514.298650]  do_fsync+0x3d/0x70
May 29 15:54:17 ubuntu20 kernel: [10514.298653]  __x64_sys_fdatasync+0x17/0x20
May 29 15:54:17 ubuntu20 kernel: [10514.298658]  do_syscall_64+0x57/0x190
May 29 15:54:17 ubuntu20 kernel: [10514.298661]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 29 15:54:17 ubuntu20 kernel: [10514.298669] RIP: 0033:0x403ace
May 29 15:54:17 ubuntu20 kernel: [10514.298676] Code: Bad RIP value.
May 29 15:54:17 ubuntu20 kernel: [10514.298678] RSP: 002b:000000c00056b9c8 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
May 29 15:54:17 ubuntu20 kernel: [10514.298681] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000403ace
May 29 15:54:17 ubuntu20 kernel: [10514.298682] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
May 29 15:54:17 ubuntu20 kernel: [10514.298684] RBP: 000000c00056ba08 R08: 0000000000000000 R09: 0000000000000000
May 29 15:54:17 ubuntu20 kernel: [10514.298685] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
May 29 15:54:17 ubuntu20 kernel: [10514.298686] R13: 0000000000000000 R14: 000000c000683860 R15: 0000000000001000

Stux · May 31, 2024, 4:14am

root@ubuntu20:/var/log# grep 'May 25' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 26' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 27' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
1266
root@ubuntu20:/var/log# grep 'May 28' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
1046
root@ubuntu20:/var/log# grep 'May 29' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
233
root@ubuntu20:/var/log# grep 'May 30' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0
root@ubuntu20:/var/log# grep 'May 31' kern.log | grep 'nfs: server titan not responding, still trying' | wc -l
0

guess which days I was running DragonFish on.

(same results when searching for not responding, still trying)

Stux · June 5, 2024, 11:20pm

Thanks for the detailed ticket but we do not have the time or resources to investigate the various complexities of running a virtualized guest that accesses a resource on the hypervisor for which it is being hosted on.

Apparently, attempting to mount an NFS or SMB share from TrueNAS in a guest VM is no longer a supported configuration.

Also, virtfs is not supported and the recommended solution is to use NFS or SMB. Go figure.

Stux · June 6, 2024, 12:03am

This is the documented way to access nas files/directories from a vm, which is not supported.

If you want to access your TrueNAS SCALE directories from a VM, you have multiple options:

If you have only one physical interface, you must create a bridge interface for the VM.

If your system has more than one physical interface you can assign your VMs to a NIC other than the primary one your TrueNAS server uses. This method makes communication more flexible but does not offer the potential speed of a bridge.

Linux VMs can access TrueNAS storage using FTP, SMB, and NFS.

BitByteBit · June 6, 2024, 6:47am

@Stux just confirming this issue is related specifically to a TrueNAS VM running on the same TrueNAS host, and accessing the host storage as the disk for the VM over NFS?

I’ve been happily running TrueNAS as a VM (with HBA passthrough of course) and using it (the TrueNAS VM) to server up NFS shares for use by VM disks which are running on the same hypervisor thats hosting the TrueNAS VM without issues (for years on CORE, and so far no issues after migrating to Dragonfish)…

Stux · June 6, 2024, 6:52am

That is not confirmed.

It occurs in the scenario. It may occur in others. No further investigation was performed “closed with no changes”

Bad news, apparently that’s no longer supported either:

we do not have the time or resources to investigate the various complexities of running a virtualized truenas that has hardware being passed through to it

jtmusson · June 6, 2024, 7:17am

I have had an app (handbrake) stall which is accessing an NFS share.

I hadn’t noted any such stalling before upgrading to 24.04 Dragonfish. I had a theory it was happening due to ARC cache issues but it has happened again since 24.04.1.

Need to do some digging in logs to determine if any NFS errors as noted in the attached bug report. It may be the NFS issues are impacting more than vms?

Stux · June 6, 2024, 8:28am

I haven’t found anything in the TrueNAS host side… but I don’t really know what I’m looking for

From what I see, on the guest side, there is definitely an issue communicating with the NFS server, which is intermittent… and does not occur when the NFS server is TrueNAS 23.10.2, but does when it iss 24.04.0, 24.04.1 or 24.04.1.1

No pings are lost between the client and host while the NFS is not working.

Restarting the guest/client… gets rid of the problem temporarily.

BitByteBit · June 6, 2024, 11:20am

Well that’s not ideal… especially since I just did the migration from CORE
At least my SCALE VM has been booting fine on Dragonfish-24.04.1.1 including multiple reboots since the upgrade from 24.04.1…
Hypervisor is XCP-ng (Xen) if that makes any difference… fingers crossed it keeps on working

Vollans · June 6, 2024, 8:49pm

Ah! That sounds like why I had trouble last weekend trying to migrate from my Roon server being a standalone box that accessed an SMB share on the NAS to being a VM on the same NAS using SMB shares to access the music. It constantly froze and was totally unusable. At the time I thought I was just me being a newbie numpty and decided to walk away slowly and pretend I hadn’t tried…

BitByteBit · June 6, 2024, 9:26pm

Interesting! Forgot to add that I’ve had no issues accessing SMB shares hosted on my TrueNAS SCALE VM either, so it doesn’t seem to be universally affecting everyone…