SCALE Dragonfish Kernel TCP Error (Aquantia 10 GbE NIC): "Driver has suspect GRO implementation, TCP performance may be compromised."

SinisterPisces · August 30, 2024, 11:01pm

Hello!

I’m seeing this in my syslog:

Aug 26 10:01:05 vectorsigma kernel: TCP: bond0: Driver has suspect GRO implementation, TCP performance may be compromised.

From the research I’ve done, I found this page: Driver has suspect GRO implementation, TCP performance may be compromised – JoeLi's TechLife

If calculated MSS (Maximum Segment Value) is higher than Advertised MSS, then set the new MSS to ‘Advertised MSS’. In earlier kernels, this was not [done] and could have caused performance issues while GRO (Receive Offload) was used. It’s based on these MSS values, TCP window size is determined for data transfer.

RHEL 7.4 introduced a fix for this into its kernel back in 2019, apparently, “and stated this message can be treated as informational only and as a warning one which can be safely ignored.”

However, I’m not sure if that holds true for TrueNAS SCALE.

Here’s some additional diagnostic information. Do I need to do anything to address this, or is it a harmless warning?

OS Version:TrueNAS-SCALE-24.04.2
Product:DXP8800 Plus
Model:12th Gen Intel(R) Core(TM) i5-1235U
Memory:31 GiB

admin@vectorsigma[~]$ uname -a
Linux vectorsigma 6.6.32-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Jul  8 16:11:58 UTC 2024 x86_64 GNU/Linux

admin@vectorsigma[~]$ sudo lspci | grep -i ethernet
58:00.0 Ethernet controller: Aquantia Corp. Device 04c0 (rev 03)
59:00.0 Ethernet controller: Aquantia Corp. Device 04c0 (rev 03)

admin@vectorsigma[~]$ sudo lspci -vvvv -n -s 58:00 | grep -i kernel
        Kernel driver in use: atlantic
        Kernel modules: atlantic

admin@vectorsigma[~]$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: enp88s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:1f:f7:0c:ce:be brd ff:ff:ff:ff:ff:ff
3: enp89s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 6c:1f:f7:0c:ce:be brd ff:ff:ff:ff:ff:ff permaddr 6c:1f:f7:0c:ce:bd
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:1f:f7:0c:ce:be brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.40/24 brd 10.10.10.255 scope global bond0
       valid_lft forever preferred_lft forever
5: vlan200@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:1f:f7:0c:ce:be brd ff:ff:ff:ff:ff:ff
    inet 10.10.200.2/24 brd 10.10.200.255 scope global vlan200
       valid_lft forever preferred_lft forever

etorix · August 31, 2024, 8:01am

Better file a JIRA ticket on this rather than a forum post—and not use Aquantia NICs.

SinisterPisces · August 31, 2024, 2:41pm

Thanks. I wanted to post here first in case this was a known non-issue.

The Aquantia NICs are on the board, and I’ve only got a PCIe 4.0x4 slot, so hopefully, I’ll be able to make them work.

SinisterPisces · September 8, 2024, 10:28pm

Here’s the bug report, if anyone else would like to keep an eye on it. It’s being triaged as I write this.

https://ixsystems.atlassian.net/browse/NAS-130942

SinisterPisces · September 10, 2024, 5:06pm

Update from the Bug Clerk:

Unfortunately we do not have resources to investigate this further, especially because there are no apparent issues and its hardware dependent.
We will happily integrate any provided patches and keep bringing in upstream kernel fixes. Thank you for understanding.

I’ll admit, that’s disappointing. That warning in itself seems like an issue to me, especially as this issue was supposed to have been fixed years ago according to Red Hat.

I’m not a network engineer, so I have no idea how to know if it’s causing me real problems. Speed hits? Invisible data corruption?

With all respect to the devs, it’s not enough to say that we shouldn’t use Acquantia NICs. Mine are integrated into my motherboard, and I’m not sure I want to burn my only PCIe slot on an intel NIC when I’d really benefit from an NVME mirror or a slog.

dan · September 10, 2024, 5:16pm

I expect their response would be that your poor hardware selection is not their problem. The obvious (to me) counterpoint is that they kind of talk out of both sides of their mouths in that regard–their marketing materials suggest that just about any x64 system with at least 8 GB of RAM would be suitable; it isn’t until you start digging that you hear about ECC, 16+ GB of RAM, SAS HBAs vs. cheap SATA controllers, and a pretty short list of truly suitable 10 GbE (or faster) NICs.

FWIW, my secondary NAS also has Aquantia NICs onboard, and I haven’t noticed any issues resulting from them.

SinisterPisces · September 10, 2024, 6:14pm

Exactly this. From everything I’ve seen online, there are a lot of people just getting into self-storage who watch YouTube videos that make TrueNAS seem like a great solution, and who would have a great time with it, but who aren’t trained to build servers and need a more honest and restrictive minimum set of requirements. There’s a real difference between “it will run” and “it will run well for daily use.” TrueNAS SCALE will install and boot on a ZimaBoard SBC. That doesn’t mean it should be used in production.

OPNSense does a good job of explaining minimum/good/best hardware for their platform and various use cases in a way that’s approachable for entry-level users. I’d love to see something like that for TrueNAS SCALE.

And I would have loved to buy a TrueNAS Mini with a warranty and support for a pre-tested hardware configuration, but the existing models have aged significantly without seeing a price drop, and the 8-bay, which is what I actually wanted, isn’t even in production, with no announcement of a replacement. I ended up with a great system (though the PSU is a bit crappy and noisy), but it wasn’t at all what I wanted or would have been willing to buy if iX was actually selling it.

Thanks for this. I’m going to mark your post as the solution. As I suspected this was mostly an outdated caution message that wouldn’t matter in the real world.

etorix · September 10, 2024, 6:42pm

The most charitable reading is that Aquantia NIC are targeted at the consumer market, not the server, iX is not going to spend the money to bring the drivers to server-grade quality, and Linux developers are not going to it either. So no solution in sight.

If you do want to throw in a well supported server NIC, note that second-hand Chelsio T520 or Solarflare NICs are cheap (and Intel NICs may be even cheaper… if you know how to track them as HP/Dell/Lenovo-rebranded parts) and there are cheap x8x4x4 risers to put a low profile card and two M.2 in a x16 slot. It doesn’t have to be a NIC or NVMe drives.

SinisterPisces · September 10, 2024, 7:09pm

In my case, unfortunately, it’s a PCIe 4.0x4 slot. I’m most likely to put a 2xNVME card on it, one of the ones with a PCIe switch, so I can use it to store VMs.

I’m in a home office environment, so as long as the AQC NICs don’t cause data loss, I’m okay if they’re not the fastest things ever. Though, I would like to clear the extraneous warning from my logs.

I get where you’re coming from with this, but it reinforces how bonkers the 10GbE marketing is right now. It’s not marketed as a consumer technology, but we have a class of devices (Acqantia NICs, mostly) that we accept as being of lesser quality (mostly due to driver support, from what I’ve seen)–a “consumer” tier in a non-consumer product segment.

Worse, we’re in a situation where the AQC NICs show up in SOHO-aimed expensive prebuilt server systems and on entry-level server motherboards and on higher-end production Macs–all cases where PCIe expansion for alternatives is limited.

Yes, I’d love to be able to buy a SOHO-focused NAS with a single 2.5 GbE NIC for management and room for a couple of PCIe slots to add my own fast storage and fast networking, but that product does not exist for under $1500, and you’re going to end of compromising elsewhere.

I’d really like to see Linux and BSD upstream devs meet the market where it is. AQC NICs aren’t going anywhere except into more machines in the low-mid- and mid-range market.

NickF1227 · September 10, 2024, 8:00pm

I think the annoyance here with the state of affairs of Aquantia NICs is being misdirected at TrueNAS. The Linux kernel itself is generating the message

Aug 26 10:01:05 vectorsigma kernel: TCP: bond0: Driver has suspect GRO implementation, TCP performance may be compromised.

The Linux kernel is telling the user that the driver is the problem. Silencing the alerts would not resolve the fact that the driver is (according to the kernel) poorly written. This is Aquantia’s (Marvell’s) fault. Their customers should be upset with them, not with the Linux kernel or with TrueNAS.

The only workaround I can think of would be to disable offloading on the card.

Can you type this command and paste the output here? You may be able to pretty easily turn the suspect feature off.

ethtool -k enp88s0 | grep gro && ethtool -k enp89s0 | grep gro && ethtool -k bond0 | grep gro

SinisterPisces · September 13, 2024, 9:00pm

Thanks!

Dragonfish doesn’t include ethtool. How did you get it onto your system?

% apt search ethtool
Package management tools are disabled on TrueNAS appliances.

Attempting to update SCALE with apt or methods other than the SCALE web
interface can result in a nonfunctional system.

volts · September 13, 2024, 9:50pm

Is the warning about Aquantia hardware or the bond0 interface? Do you get the same warning if you drop it and use a single interface? Would be useful to know the true source.

bond0 … GRO … what’s the state of how these interplay?

Re: [Question] About bonding offload - Jay Vosburgh

NickF1227 · September 14, 2024, 2:55am

I did not load it, it was included in my dragonfish system.

root@rawht[~]# cat /etc/version
24.04.2# 
root@rawht[~]# ethtool -h
ethtool version 6.1

It’s also in EE Beta 1

root@prod[~]# cat /etc/version
24.10-BETA.1#                                                                   
root@prod[~]# ethtool -h
ethtool version 6.1

I think you may have been logged into the shell with the admin account. You would have to do sudo to run ethtool. You can run the commands as I sent them if you run sudo su first to make you root.

NickF1227 · September 14, 2024, 2:58am

It is for the bond0 interface, but I can assure you on the many SCALE system’s I’ve seen with a bond this would be the first time I’ve seen this.

More likely the problem is manifesting lower down the stack (the driver for the interfaces underneath) and working it’s way up to the bond’s abstraction layer.

I’m just troubleshooting bottom up and your looking at it top down. It’s also far more likely (in my option) that an Aquantia driver would be the source of the problem…rather than the Linux kernel implementation of a bond interface.

volts · September 14, 2024, 5:34pm

Did you read the link I shared? The software “stacking” network interface drivers do interact with the various offload/acceleration strategies.

Easy enough to confirm either way.

SinisterPisces · September 15, 2024, 12:58am

Oops. Egg on my face there. I’m not entirely used to Linux distros that don’t include the full paths even in admin accounts and force you into root. TrueNAS SCALE is quite rigorous about that.

root@vectorsigma[~]# echo enp88s0 && ethtool -k enp88s0 | grep gro && echo enp89s0 && ethtool -k enp89s0 | grep gro && echo bond0 && ethtool -k bond0 | grep gro  
enp88s0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
enp89s0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
bond0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off

I’m not great with network diver stuff, but i looks like it’s saying all the GRO offloading is already off? And more to the point, it’s fixed off, which I’m reading as possibly not being supported in the hardware?

EDIT: On a hunch, I decided to search for “offload” by itself, and got some more interesting (and contradictory) results.

The individual gro values are all off/fixed off, but generic offload itself appears to be on.

root@vectorsigma[~]# echo enp88s0 && ethtool -k enp88s0 | grep offload && echo enp89s0 && ethtool -k enp89s0 | grep offload && echo bond0 && ethtool -k bond0 | grep offload 

enp88s0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

enp89s0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

bond0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

EDIT:
There’s also a version of these drivers (that must be compiled for Linux) available directly from Marvell.

I’m not sure if these are different drivers than the atlantic driver, or if they are, how much newer they are.

copies-2 · September 15, 2024, 11:23pm

Glad to see another ugreen device!

I checked my DXP6800, which also shows the same syslog messages. Not totally unexpected as I believe they share the same motherboard.

A couple of questions for you @SinisterPisces:

Is your TrueNAS OS installed baremetal, or virtualized?
In the BIOS, under Advanced -> Network Stack Configuration, do you have Lan BootROM enabled or disabled?
In the BIOS, under Security, do you have Secure Boot set to disabled?

FWIW, I haven’t had any network performance issues on my machine, but I’ve only attempted 1Gbps and 2.5Gbps networking.

Output from my 24.04.2.1 system:

root@truenas[/home/admin]# ethtool -k enp88s0 | grep gro && ethtool -k enp89s0 | grep gro && ethtool -k bond0 | grep gro
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
netlink error: no device matches name (offset 24)
netlink error: No such device

Unrelated to your NIC question, but I’m curious if you’d humor me and see if you have these entries in your syslog too?

admin@truenas[~]$ sudo cat /var/log/messages | grep "igen6"
Sep 14 01:03:52 truenas kernel: caller igen6_probe+0x1a6/0x8d0 [igen6_edac] mapping multiple BARs
Sep 14 01:03:52 truenas kernel: EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
Sep 14 01:03:52 truenas kernel: EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
Sep 14 01:03:52 truenas kernel: EDAC igen6: v2.5.1

admin@truenas[~]$ sudo cat /var/log/messages | grep "resource sanity check"
Sep 14 01:03:52 truenas kernel: resource: resource sanity check: requesting [mem 0x00000000fedc0000-0x00000000fedcffff], which spans more than pnp 00:04 [mem 0xfedc0000-0xfedc7fff]

dan · September 16, 2024, 1:06am

cat | grep is kind of inefficient when you can just grep. But I have both of these on my 6800, though quite intermittently.

copies-2 · September 16, 2024, 1:27am

Thanks, I appreciate the spot-check.

SinisterPisces · October 5, 2024, 8:06pm

Hello, again,

So, it’s been a couple of point updates since I posted this thread. I’m at Dragonfish-24.04.2.2 now. After I did a clean install of TrueNAS Scale from ISO and restored my config to address another issue.

I haven’t seen this error in a while. Even though iX closed the bug and I didn’t see anything mentioned in the release notes, it looks like one of two things might have happened:

A Linux kernel patch/patch to some other system that was causing side effects has fixed whatever behavior was throwing this warning; or
The atlantic driver was updated. Unfortunately, modinfo doesn’t print the version for the atlantic driver, so I have no way of knowing what it was before or what it is now.

GRO settings look like this:

]# echo enp88s0 && ethtool -k enp88s0 | grep gro && echo enp89s0 && ethtool -k enp89s0 | grep gro && echo bond0 && ethtool -k bond0 | grep gro
enp88s0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
enp89s0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off
bond0
rx-gro-hw: off [fixed]
rx-gro-list: off
rx-udp-gro-forwarding: off

Here’s the current value of all the offload settings.

]# echo enp88s0 && ethtool -k enp88s0 | grep offload && echo enp89s0 && ethtool -k enp89s0 | grep offload && echo bond0 && ethtool -k bond0 | grep offload
enp88s0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
enp89s0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
bond0
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]