I’m an ex-VMware trainer, stoped at 8.The labs we used for the delegates always used iSCSI, with the MTU’s set to Jumbo. It always worked flawlessly. With NFS you lose functionality such as VMware Fault Tolerance.
The “every write is a sync write” VMware-over-NFS thing has always put me off…
Regarding the iSCSI multipath comment from 9 months ago… it does work. It’s the original tech for VMware storage over combined links. Multi-link NFS is the new kid on the block…
I’m experiencing the same behavior. I only just moved from Core to Scale (I moved from Core to ElectricEel 24.10.2.1). I noticed my VMWare NFS Datastore was giving me all kinds of strange errors after the upgrade to Scale. In the meantime I’m also upgrading my primary nas from a Supermicro X10 mobo to a Supermicro X11 mobo so I have two systems to test this behavior.
I was running NFS4.1 on Core, I was using the multipath feature and had direct connections between the ESX hosts and the X10 based TrueNAS instance. For the direct connections I was using a mix of the builtin Gigabit NICs and an Intel Pro/1000 PT Dual Port NIC
Reverting back to a single (onboard) NIC and connecting the Datastore using NFS 3 restored the connectivity to the VM’s again.
I prepared the X11 mobo with a couple Intel X550-T2 Dual 10GB NICs and did some tests as well.
I tried migrating a test VM (in a shutdown state) back and forth from a local ESX datastore
X10
onboard NIC - nfs3 - works
onboard NIC - nfs41 - error: The virtual disk is either corrupted or not a supported format.
Intel 1000PT Dual NIC - nfs3 - [I can only test this next week when I physical access to my environment again]
Intel 1000PT Dual NIC - nfs41 - [I can only test this next week when I physical access to my environment again]
X11
onboard NIC - nfs3 - works
onboard NIC - nfs41 - The virtual disk is either corrupted or not a supported format.
Intel X550-T2 - nfs3 - works
Intel X550-T2 - nfs41 - The virtual disk is either corrupted or not a supported format.
NFS4.1 still works with my current backup TrueNAS instance (still running Core, HP DL120 G7)
I am running the latest version ESXi 8
I know it’s mentioned already that iSCSI is recommended, and I don’t remember why in the end I chose for NFS when I set it up before. I will have another look at iSCSI now and probably move to iSCSI but still, NFS should work, it is working with a Core instance, and does not work with my Scale instances. If there’s any information I can provide or more tests I can do in an effort to find the cause let me know.
Bringing this back to life, as I have been experiencing this with truenas scale (physical server), esxi and nfs 4.1. This works ok if you mount the volume with nfs 3. The problem appears to be with NFS 4.1 and a slog device being used. It appears that any writes going via NFS 4.1 do not show up immediately when a slog device is added. So ESXi fails when storage vmotioning as the moved files do not appear right away. Also same can happen with creation of VMs, if you create a VM and immediately start it, it can fail to start.
Move all this to v3 and it all works ok, same with removing the slog device on v4 (need to test that more to be 100% sure).
Edit:
OS: TrueNAS scale 25.04.1
Tested with a 8 x 11 disk raidz3 with mirrored ZeusRAM SLOG NFSv4.1, problems
Tested with 3 x 12 disk raidz2 with mirrored ZeusRAM SLOG, NFS3, OK. NFSv4.1, problems
Tested with mirror pool, no SLOG, NFSv4.1, OK.
You can try Report A Bug. Either through TrueNAS GUI. Upper right smile icon for Feedback / Report A Bug or through the forum, upper right Report A Bug, near top of screen. Please link the ticket created in this thread.
Opened one: NAS-136888
So, it would seem that writes are not sync.
Can you verify the sync settings for the datset?
The writes are sync. They were set to standard as default and also tested with sync set to always. Behaves the same. Only when the SLOG device is removed does it behave correctly with NFS 4.1.
Even if they were not sync writes, the data should be updated for the clients, if its not, then async writes can not be used for anything, as you now have an inconsistent file system at runtime, not just on failure. The read should come from the write cache if it hasn’t been flushed.
Its a mystery.
Is there any chance the client is not set-up to commit writes and wait for acknowledgement?
The client is VMware I assume??
It is behaving like the NFS data is mounted as async. Some discussion here:
If it was the mount / client, then removing the SLOG would have zero affect wouldn’t it as that’s server side.
This is with VMware / ESXi 7.
This is also storage vmotioning so if there was a client side async problem, it should just not show up for the other clients, this is the client that is writing not getting back what it has written / says has been committed. So that would then mean a problem with the client NFS 4 implementation. If it was async writes, it should show as committed to the client, via its write buffer, even if its not actually committed server side, shouldn’t it, unless its broken.
Looking to see if I can create a script to replicate it on another server mounting the NFS share.
I also suspect its Server side… just looking for proof.
I don’t know if its a NSF server config issue or a software issue.
In case it helps, also just discovered problems using an NFSv4 datastore on an ESXi 8 host.
Sometimes when cloning or relocating a VM onto the NFS volume, it works fine; more often than not though, it fails with error:
“The virtual disk is either corrupted or not a supported format”
As a test, I switched the datastore to NFSv3, it all works fine.
ESXi host version 8.0.3
Physical TrueNAS server version: 25.04.2.6
Hi,
Before Christmas I responded to this thread and also created a new thread, discussing similar NFS behaviour talked about here (with extra info I’d received from Broadcom / VMware on the topic).
I was msg’ing someone who seemed to be a member of the TrueNAS staff discussing the problem with NFS4 and attempting to find a cause / solution. Before the new year, I said I may be able to create a test TrueNAS that we might use to perform various tests with.
That thread and the messages between myself and this individual appear to be gone. I’m not sure why.
Just after Christmas I had an accident and broke some ribs, resulting in a delayed update from me. I’m happy to report that I have since been able to find enough spare hardware to set up a physical test TrueNAS box that can be used for testing ![]()
If anyone from TrueNAS is still interested in trying to diagnose the NFSv4 issue, please let me know.
Are you talking about this issue:
Looks like it didn’t get backported to 25.10.2. You can try against our 26 nightlies and see if it resolves the issue.
I’m assuming that was me you’re referring to - and I recall that conversation and the discussions back and forth re: the Broadcom support tech passing along the details of the handling. Not sure where those messages wafted off to, but that’s distressing.
Give the latest 25.10.2.1 release a shot here:
Hiya ![]()
Really weird how the thread I created and our direct msg’s have all just gone; but happy to know you didn’t get run over by a truck or something ![]()
I will try updating the test TrueNAS to that version after lunch!
Update seems fine. Diagnostics spotted one or two disks are malfunctioning, I’ll replace them ASAP.
So it resolved the issue you / broadcom were seeing?
No further spurious NFSv4 errors regarding file locks, failure to open, etc?