But that was exactly the scenario I tested in both my cases.
ESXi host, TNS/TNC physical installation.
When using the combination TNS, ESXi8, nfs4, onboard NIC things work, when using the combination TNS, ESXi8, nfs4, MLX NIC things didnt work.
I totally understand that you don’t … fancy… MLX cards. but I offered to run the same test on a Chelsio card I have…
I am fine just running this on Core.
By the time you guys stop supporting it ESXi is dead for @ home anyway, just thought you might want to fix this since it came up a couple of times before I experienced it and ppl just circumvented the problem (mainly by going to nfs3).
Sorry I misunderstood this. As with Davvo, I was confused by the comment about physical TrueNAS working fine.
Mellanox NICs should be OK
I’d suggest describing things in a different order to clarify
TrueNAS Baremetal with Mellanox NIC
Connecting via NFSv4 on (what speed)
to ESXi8 host with what NIC?
The fact that it works on onboard NIC makes me think there’s a networking issue.
Have you checked for any packet losses in NICs or switches.
Alright, I see I did not manage to make myself clear
The initial attempt was with a virtualized setup since that is what I want to run to save the overhead of a second box. I setup TNS in a VM, shared a NFS drive via NFS4 and it didnt work. The same share via NFS3 worked. I setup a TNC vm with the same drives/virtual HW and NFS4 worked. I verified that even the latest RC of TNS is not working.
I then opened a ticket upon your advice.
In the ticket they asked me to verify the issue on a physical TNC implementation.
I did that doing a simple test in my office with only basic Ethernet (assuming it was more related to the nfs process) - this worked fine. This was using to onboard NIC (Intel i210) of the X12STH-LN4F. I therefore closed the ticket since I assumed a physical box works fine and this must be related to virtualization.
I went on to set up a physical TNS box with the target HW. I tried the NFS4 connection again and it failed. This was using a MLX CX5 25G nic.
I tested connecting via the onboard i210AT (on an Asus Z11PA-U12), that was working fine. Therefore I opened the second ticket stating that the issue persists and seems to be related to the chose NIC, seemingly being a problem with the MLX over the Intel onboard NIC. This ticket got closed.
After that I reinstalled the same box with TNC which is now working fine again with the same physical HW on NFSv4
I don’t think this should be related to networking, as the same hardware with the same networking setup is working perfectly fine with TNC, but has problems with TNS.
The only moving part in this is the OS on the box.
There’s more switches in the path on 1G than on 25/100G…
No packet loss I have seen, also I don’t think it would fit the picture.
Packet loss should come around as low transfer speeds or transfer breaking completely, not as inability to initiate a transfer in the first place.
Also, as mentioned, identical HW with TNC as target, no problems whatsoever, running stable storage for ~10VMs since day 1 (ie 5 days ago) over the same infra.
The puzzle here is that for some hardware/NIC it worked.
For another NIC it did not work.
So, my (perhaps wrong) conclusion is that its network related… either driver, NIC firmware or switch. No packet loss and so there’s no evidence.
If you were a commercial customer… we’d probably swap the NIC to our preferred version and see if the problem goes away. Any chance of using another NIC?
Or if another user could see the problem on their setup… we’d have a pattern.
I already offered to use Chelsio T62100-LP-CR’s.
On ESXi? on TNS? both?
I can also try a direct connection to rule out the switch.
Maybe its only the multipath connection…
Yeah, there were a few ppl in the old forum having issues, but they rarely provided full specs.
I can’t imagine its only me, I’ve tried different platforms and separate installs so it seems repeatable.
Maybe a reader using TNS/ESXI (7,8) and not the onboard NICs would be so kind to quickly test this? Happy to provide a config guide if needed. Just looking for some (non) working hw combinations to home in on potential causes…
new setup Chelsio T62100-LP-CR connected back to back (Fujitsu TX1320M3’s this time)
I downgraded to ESXi 7.0.3 to see if its just a particular ESXi version
I tried with single ip only to have another datapoint,
As expected, same problem with nfs41.
That same mount via nfs(3) is working fine.
Does not work:
esxcli storage nfs41 add -H 10.0.10.201 -s /mnt/testtns24_1/nfs -v tn24_1_nfs
Thanks…can you keep the setup so we can diagnose?
It would be useful if you try Dragonfish 24.04.
The NFSv4.1 code is in linux and dragonfish has a linux kernel update.