ESXi 8 U2, nfs4 - Core works fine but Scale is causing issues

Hi,

I am building a new AIO VMWare Host which currently is running ESXi 8 U2.

Given that Core is going to … die a slow death … I thought I would give Scale a try for that build.
Installed everything as always, set up NFS, mounted my Share (nfs4) and tried moving a VM to it.

esxcli storage nfs41 add -H 172.16.10.32,172.16.11.32 -s /mnt/tank7/nfs -v tn15_nfs

Unfortunately - no go:

2024-04-06T19:23:47.324Z In(182) vmkernel: cpu6:2098542 opID=23545234)NFS41: NFS41FileOpenFile:2911: Open of 0x430a56466360 with openFlags 0x1 failed: Stale file ha                                                                                                                                                         ndle
2024-04-06T19:23:47.324Z Wa(180) vmkwarning: cpu6:2098542 opID=23545234)WARNING: NFS41: NFS41FileOpOpenFile:3627: Open of obj 0x430a56466360 (fhID 0x134877) failed:                                                                                                                                                          Stale file handle
2024-04-06T19:23:47.324Z In(182) vmkernel: cpu6:2098542 opID=23545234)NFS41: NFS41FileOpenFile:2911: Open of 0x430a56466360 with openFlags 0x1 failed: Stale file ha                                                                                                                                                         ndle
2024-04-06T19:23:47.324Z Wa(180) vmkwarning: cpu6:2098542 opID=23545234)WARNING: NFS41: NFS41FileOpOpenFile:3627: Open of obj 0x430a56466360 (fhID 0x144877) failed:                                                                                                                                                          Stale file handle
2024-04-06T19:23:47.324Z In(182) vmkernel: cpu6:2098542 opID=23545234)NFS41: NFS41FileOpenFile:2911: Open of 0x430a56466360 with openFlags 0x1 failed: Stale file ha                                                                                                                                                         ndle
2024-04-06T19:23:47.324Z Wa(180) vmkwarning: cpu6:2098542 opID=23545234)WARNING: NFS41: NFS41FileOpOpenFile:3627: Open of obj 0x430a56466360 (fhID 0x154877) failed:                                                                                                                                                          Stale file handle
2024-04-06T19:23:47.338Z Wa(180) vmkwarning: cpu6:2098542 opID=23545234)WARNING: NFS41: NFS41FileDoRemove:5286: Could not remove "testubuntu" (task process failure)                                                                                                                                                         : Directory not empty
2024-04-06T19:23:47.338Z In(182) vmkernel: cpu6:2098542 opID=23545234)VmMemXfer: vm 2098542: 2461: Evicting VM with path:/vmfs/volumes/0a55ccd0-664dfd57-0000-000000                                                                                                                                                         000000/testubuntu/testubuntu.vmx
2024-04-06T19:23:47.338Z In(182) vmkernel: cpu6:2098542 opID=23545234)VmMemXfer: 209: Creating crypto hash
2024-04-06T19:23:47.338Z In(182) vmkernel: cpu6:2098542 opID=23545234)VmMemXfer: vm 2098542: 2475: Could not find MemXferFS region for /vmfs/volumes/0a55ccd0-664dfd                                                                                                                                                         57-0000-000000000000/testubuntu/testubuntu.vmx
2024-04-06T19:23:47.507Z In(182) vmkernel: cpu1:2098561 opID=6dbd4a2b)World: 12324: VC opID lud1c16k-60790-auto-1awn-h5:70016329-2e-01-2d-3e43 maps to vmkernel opID                                                                                                                                                          6dbd4a2b
2024-04-06T19:23:47.507Z In(182) vmkernel: cpu1:2098561 opID=6dbd4a2b)VmMemXfer: vm 2098561: 2461: Evicting VM with path:/vmfs/volumes/0a55ccd0-664dfd57-0000-000000                                                                                                                                                         000000/testubuntu/testubuntu.vmx
2024-04-06T19:23:47.507Z In(182) vmkernel: cpu1:2098561 opID=6dbd4a2b)VmMemXfer: 209: Creating crypto hash
2024-04-06T19:23:47.507Z In(182) vmkernel: cpu1:2098561 opID=6dbd4a2b)VmMemXfer: vm 2098561: 2475: Could not find MemXferFS region for /vmfs/volumes/0a55ccd0-664dfd                                                                                                                                                         57-0000-000000000000/testubuntu/testubuntu.vmx

I verified that the mount was fine (it was), that permissions were fine (yes, I could create and delete a folder just fine) , and what not.

Same problem has been mentioned before, eg as a necro in this older thread

Also some thread in vmware forums, but nowhere a solution.

I tried tweaking a few Scale settings but couldn’t find anything helpful.
I saw a reference to set an Alias, but mounting worked just fine for me.

This was on TrueNAS-SCALE-23.10.1 I think (as that’s what I had, dont think i updated after installation)

I then gave up, quickly installed TNC (TrueNAS-13.0-U6.1) on same (virtual) HW, set up NFS4 identical, used the same mount command on ESXi and - it just worked.

While I am fine to run Core instead of Scale i am surprised that this has not been discussed more widely.
Or is it just working for most people on Scale?

Cheers

1 Like

Not sure what to tell you but I run ESXi 8 (with the current updates which are also still free’ish), and run both CORE and SCALE (all versions to date) without issue. I do not use NFS shares.

Hope you figure it all out.

1 Like

So is there anybody at all who managed to use ESXi 8 with nfs4 & Scale ?

As a side note, why are you using NFS for block storage? Wouldn’t iSCSI be better?

On FreeBSD, definitely. On Linux, I’m not sure, but it’s at least fairly likely to be better.

1 Like

When I deployed my first ESXi storage (a long while back) I ran a performance comparison of iscsi vs NFS for my use case ([limited amount of vms, high perf req for single/individual vms], back on 10G) and in the end it didn’t matter much, one was a bit faster on one block size, the other on another.

In the end the ease of use of a NFS Share (individual files/vms accessible) vs a large blob decided my usage of nfs.

I revisited this a few years back in a quest for more performance but found that the transport didnt make much difference overall.
Again, for my use case, which does not scale well with more vdevs, and supposedly only as long as nfs over rdma does not come into play (which iX tested and deemed not worthy [One reason I was looking at Scale since its supposed to have implemented a few of my enhancement requests]).
I should note that I run nfs with multiple (2) access paths now so i would not expect iSCSI to have significant advantages wrt that?

But I am all ears if you tell me iSCSI is faster nowadays?

In the meantime I would just hope that things that are supposed to work do work.
I’d like to identify a potential source of the problem before creating a ticket (ie is it me, is it TNS, is it ESXi (8)), which is why I asked for feedback. I didn’t think that using nfs would be such an … exotic … choice;)

As far as I understand iSCSI has always been the preferred choice in block storage due to it working at the block-level versus NFS working file-by-file; it has however its limitations, iirc it cannot do multiple access paths.

Hm,
I am fairly sure that you can define multiple LUNs to access an iSCSI share multipathed.

But be that as it may, to me its beneficial to be able to access a VM (fileset) via Truenas versus having a large blob only; so the benefits of iSCSI would have to be noteworthy (performance 10%+ or really improved handling of backups) to have me switch :slight_smile:
Maybe its more beneficial for paid customers who have access to the vmware Plugins (VAAI), but since that cannot be licensed on an individual basis I can’t use it.

VAAI support is free and included in CORE/SCALE - if you mean the vCenter plugin, yes that’s Enterprise-only, but that’s not necessary to get all the VAAI goodies going.

Does this work under SCALE with NFSv3 permissions/config? The stateful nature of NFSv4 might be resulting in more problems than it’s worth to get the pNFS benefits (which if you’re doing an AIO setup, shouldn’t be a factor)

Hm, I am not sure if ESXi uses that properly without the plugin. But I never looked into the nitty gritty details since there was little documentation of the TNC side last time I looked (but thats been a while) [Ie how to tell if its set up correctly, being used etc.}

nfs3 was not able to run multipathed on ESXi iirc, so 4 has tangible benefits.

I’ll set up a TNS vm to test nfs3, but from what I read in the old forum it should work.

From your VMware host, fire off esxcli storage core device vaai status get and look for your TrueNAS device(s).

In your AIO scenario specifically, the number of physical/logical paths shouldn’t matter - it’s all one virtual pipe through the internal vSwitch.

esxcli storage core device vaai status get
eui.343335304b1001010025384200000001
VAAI Plugin Name:
ATS Status: unsupported
Clone Status: unsupported
Zero Status: unsupported
Delete Status: supported
Ex Clone Status: unsupported

I assume that’s over NFS. iSCSI attached LUNs show as:

naa.6589cfc000000d0b9a44237f9b5007dd
   VAAI Plugin Name:
   ATS Status: supported
   Clone Status: supported
   Zero Status: supported
   Delete Status: supported

I see - a tangible benefit for iSCSI then;)

Might need to give it another try then, but for now, nfs it is.

Edit:
@HoneyBadger
As expected…

esxcli storage nfs41 add -H 192.168.124.26  -s /mnt/test/nfs/ -v testnfs4
esxcli storage nfs41 remove  -v testnfs4
esxcli storage nfs add -H 192.168.124.26  -s /mnt/test/nfs/ -v testnfs3

Regardless of whether iSCSI is better than NFS, it should be possible for ESXi to work with NFSv4 to TrueNAS SCALE. SCALE does use the standard Linux NFS software, so it will change with the update from Cobia to Dragonfish.

If anyone can test Dragonfish and see the same issue, we’d appreciate a bug report and a system to diagnose.

Done. Same result

NAS-128296

2 Likes

So,

it turns out that this is a VMWare only problem.
Moving a VM to a physical TNS installation (23.10.2) is working fine, so its only the ESXi to virtualized TNS that’s having issues.

I probably should try with another virtual nic type (not vmxnet3) or a pass-through / SRIOV one…

I asked for the ticket to be closed.

1 Like

So,

a new twist occurred.
Tried using the previous working installation for another server and now am experiencing the same problem on that box.

Seems as if the vm move is working fine when targeting the onboard intel nic (i210/i210AT), but not when using other nics (vmxnet3 on the AIO or a MLX CX5 on the dedicated hw).
Not sure if it makes sense to reopen, so i sent in another report

Hmm,
I feel slightly confused, the ticket got closed with “Won’t help”…

When I wrote my suggestion, I was assuming that TrueNAS was not virtualized and was the physical storage.

If we could see the issue on physical environments, we can help debug. On virtualized environments, we need the hypervisor supplier to help debug.