I was getting a normal 10Gbe connection prior to upgrading. After going to Fangtooth, file transfer would just pause after 10 seconds and remains that way for a very long time and times out. The same happens with browsing directories over SMB where no files will load after going into a directory. I did some iperf tests and saw that I was only getting 1gbit transfer. I don’t think it’s in infiniband mode since there’s an established connection and in can’t install open-ib to check due to apt restriction on truenas.
Restarting TrueNAS seems to have resolved the issue but after a while the connection would degrade to 1gbe. I can’t tell what’s causing this. Here’s my dmesg:
admin@truenas[~]$ sudo dmesg | grep mlx4
[ 1.220356] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 1.220368] mlx4_core: Initializing 0000:01:00.0
[ 1.220414] mlx4_core 0000:01:00.0: enabling device (0140 -> 0142)
[ 8.174375] mlx4_core 0000:01:00.0: DMFS high rate steer mode is: disabled performance optimized steering
[ 8.174684] mlx4_core 0000:01:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
[ 8.207904] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
[ 8.208195] mlx4_en 0000:01:00.0: Activating port:1
[ 8.220667] mlx4_en: 0000:01:00.0: Port 1: Using 8 TX rings
[ 8.226918] mlx4_en: 0000:01:00.0: Port 1: Using 8 RX rings
[ 8.233477] mlx4_en: 0000:01:00.0: Port 1: Initializing port
[ 8.239951] mlx4_en 0000:01:00.0: registered PHC clock
[ 8.243417] <mlx4_ib> mlx4_ib_probe: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 8.243829] <mlx4_ib> mlx4_ib_probe: counter index 1 for port 1 allocated 1
[ 8.247878] mlx4_core 0000:01:00.0 enp1s0: renamed from eth0
[ 10.763516] mlx4_en: enp1s0: Link Up
[ 65.555787] mlx4_en: enp1s0: Steering Mode 1
[ 65.567016] mlx4_en: enp1s0: Link Up
Last 100 lines since i can’t include the entire dmesg
[ 30.653379] power_meter ACPI000D:00: Found ACPI power meter.
[ 30.653450] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[ 63.335560] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 63.340167] NTB Resource Split driver, version 1
[ 63.342815] Software Queue-Pair Transport over NTB, version 4
[ 63.355772] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 65.351995] bond0: (slave eno1): Enslaving as a backup interface with a down link
[ 65.417272] bond0: (slave eno2): Enslaving as a backup interface with a down link
[ 65.555787] mlx4_en: enp1s0: Steering Mode 1
[ 65.567016] mlx4_en: enp1s0: Link Up
[ 69.041092] igb 0000:09:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 69.073084] igb 0000:0a:00.0 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 69.196702] bond0: (slave eno1): link status definitely up, 1000 Mbps full duplex
[ 69.196719] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 69.197936] bond0: (slave eno2): link status definitely up, 1000 Mbps full duplex
[ 69.197949] bond0: active interface up!
[ 80.373872] NFSD: Using nfsdcld client tracking operations.
[ 80.373875] NFSD: no clients to reclaim, skipping NFSv4 grace period (net f0000000)
Interesting! I also use ConnectX-3 cards on my TrueNAS boxes. (They are end of life, but quite reliable on Linux in my experience, and come in compact form factors.) I was worried when I saw this post, but so far, my speeds still seem as expected:
Thanks for confirming that it’s working fine for you. I went and swapped out my transceivers and it seems to have fixed the issue. It was strange because running iperf3 listener on windows shows about 7 - 8 Gbps, from from windows to linux (listener) it was under 1Gbps. My transceiver on the switch side was running super hot though, I wonder if that was the issue.
Ah, that is strange, but glad it’s working for you now!
Indeed, I left it running for a couple hours and didn’t observe any drop in through between a Windows box (with a ConnectX-4 card) and a Fanghorn box (with a ConnectX-3 card):
Ah, quite possibly. I don’t remember the details, but I recall reading about driver issues with the ConnectX-3 cards on Windows 11. That’s why I got a ConnectX-4 for my Windows desktop. (It’s also an ancient card, but uses their newer WinOF-2 driver stack instead of the original WinOF driver that the X-3s use.)
FWIW, the example I gave above was 10G bidi traffic between a MCX311A-XCAT (TrueNAS 25.04) and a MCX4121A-ACAT (Windows 11). I haven’t had any issues with that setup, personally.
At least when I bought it, the ConnectX-4 cards were pretty decently priced. Only real drawback is they’re not tiny like the MCX311As.
Yea you’re right. I am seeing other people with Connect-X3 reporting degraded speed on windows 11 the past few months. Luckily I have two Connect-X4 available to swap out. I was just avoiding using them because they get so hot but nothing an extra intake fan won’t fix. Thanks for checking your settings to validate my issue!
What I’ve seen with my connectx-3 card and windows 11 was a reduction to around half the normal speed after the windows box wakes up from sleep mode. If I disable sleep I don’t see a degration.
I’ve already swapped over to my Connect-X4 on my Windows 11 box and I haven’t seen that issue since. From the last iperf3 test I did, you can tell that it was the windows machine box that’s having issue.
Here is the ifconfig output from Truenas (enp1s0 is the Connect-X3 card):
I’ve noticed this as well during my testing from seeing the issue in this thread. It has been happening for a while and I haven’t noticed. I’ve been storing my files on my external HDD recently and haven’t backed them up to my NAS. It was only that I started to catch up on missing backups that I noticed it, coincidentally right went Fangtooth released. It’s definitely Windows 11 and Connect-X3 drivers that are the issue. Switching to my spare Connect-X4 shows normal behavior.
It started happening again. This time I verified that it was not Windows causing the problem.
Rebooted Windows first when this happened. (iperf 3 sub 1Gbit).
Rebooted Truenas next and solved the issue. (expected 9.x Gbit).
Left everything as is and tested a few hours later and iperf3 shows sub 1Gbit.
Things I’ve done:
Tried turning of ASPM by adding GRUB_CMDLINE_LINUX="pcie_aspm=off pcie_port_pm=off" to /etc/default/grub.d/truenas.cfg and sudo update-grub. Did not fix the issue
Looked through,dmesg, /var/log/error, /var/log/syslog no indication of error relating to the Connect X-3 card.
Manually reloaded the Mellanox kernel modules: modprobe mlx4_core, modprobe mlx4_en, and modprobe mlx4_ib speed issue resolved, receiving expected 9.x Gbit.
3a. I’ve also tested with multiple concurrent streams with iperf3, it does not make a difference when a normal connection test with iperf3 shows sub 1Gbit.
I should mention my server specs at this point in case that has anything to do with it:
4a. Intel S1200SPLR, Xeon E3-1270 V5, Intel Integrated RAID Mezzanine Module RMS3HC080 SAS/SATA H24093-204, Connect X-3 in PCI-E 3 x8 Slot 4 (direct to CPU), ASM2812 M.2 nvme bifurcation card PCI-E 3 x8 Slot 5 (PCH), Radian Memory Systems Inc. RMS-200 PCI-E x8 Slot 6 (direct to CPU).
4b. I’ve switched the Connect X-3 and Radian card as well but the issue still occurs both lanes are directly connected to CPU. I have not bothered putting it on Slot 5 since that goes through PCH.
Next thing I will test is reloading which particular kernel module(s) will resolve the speed degradation or will reloading all 3 necessary. I still don’t understand what is causing this or the interval at which this happens (may be related).
Lastly, to rule out upgrading as the source of the error, this was tested on a fresh install of Fangtooth with no previous configurations loaded.
You have two IP addresses on the same network on the same server, 192.168.1.5 (enp2s0) and 192.168.1.35 (bond0). That means there are two routes to/from your server on the same network.
This setup is invalid and is liable to cause at least some of the discrepancy you are seeing where you were limited to around 111 MBytes/s in one direction.
The solution is to not have two IPs on the same subnet on the same device.
I just thought about this as well and deleted the bond0 connection to see if it resolves the issue. That’s actually good to know not to do this. So the recommended practice is to always only have 1 connection? If I wanted a failover then I should group all 3 nics and set the other two for fail over with the same IP?