Configuring Chelsio T5 on SCALE - how to bypass error messages?

Hi all,

I am tuning a Chelsio T540-CR (4 x 10G) on SCALE 25.10, and hitting a problem related to card configuration.

As best I can see, the cxgb4 driver defaults to basing queue count on logical CPU count. I have a 20 core CPU (40 logical cores, E5 2698 v4) to prevent CPU starvation which affected my 8 core CPU. The problem seems to be that the driver is set to initialise one RX+TX queue per logical CPU per port, or 40 RX/TX queues for every port (160 RX + 160 TX total), which immediately exhausts the ASIC’s onboard SRAM/SGE contexts, and I can’t seem to find a way around errors when I run the commands that should mitigate this.

I’m also hitting issues controlling TX coalescing and ring buffer sizes as a secondary mitigation.

Symptoms:

  • Ring buffers show RX: 64 / TX: 1024, and I dont seem able to change them. Attempts to increase RX ring buffers via ethtool -G (even to 128 or 256) return netlink error: Device or resource busy even with the interface down.

  • Attempts to reduce the number of queues to prevent card RAM overuse via ethtool -L returns Operation not supported, preventing me from reducing the 40 queues to a more sensible 8 or 16 to free up descriptor memory.

  • Attempts to reduce IRQ floods by increasing tx coalescing intervals via tx-usec or tx-frames fail, and seem to be rejected by either firmware or driver. (rx-usec coalescing succeeds). Current parameters are:

    rx-usecs: 100
    rx-frames: 8
    rx-usecs-irq: n/a
    rx-frames-irq: n/a
    
    tx-usecs: 0
    tx-frames: n/a
    tx-usecs-irq: 0
    tx-frames-irq: n/a
    

    I dont seem to be able to modify the tx values.

Impact:

High packet drops (rx_nodesc_drop) and rx_runt_frames because a 64-packet RX buffer is insufficient at this speed. (I gather runt frames can be a flag for descriptor exhaustion on Chelsio)

Verified via ethtool -S that write_coal_fail is high.

Current Tuning Applied:

MTU 9000
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 87380 67108864
net.core.netdev_max_backlog = 250000

T540-CR firmware versions: 6.12.33-production+truenas, firmware-version: 1.27.5.0, TP 0.1.4.9, expansion-rom-version: 1.0.0.90

Questions:

  1. Queue count: Is there a persistent way in SCALE to reduce the number of queues to be less than the number of logical CPUs, and force a lower queue count at boot?

  2. Ring buffer sizes: Is there a known way to break the Device or resource busy lock on these descriptors and change the RX (and ideally also TX) ring buffer sizes from 64/1024?

  3. TX coalescing parameters: Is there a way to amend TX coalescing parameters, either by usec or frame count, in order to reduce IRQ and context switch burdens on the CPU?

cxgbtool:

This Chelsio tool might shed light on useful diagnostic info but isnt included in 25.10. It’s the only way I know to see the SGE table directly (among other diagnostic data) and confirm SGE context allocation and identify the exact issues for Chelsio on TrueNAS. Is there a way to forcibly run it on SCALE anyway today, or do I need to put in a feature request/suggestion for future?

Part of the issue solved -

1) cxgb4 will not resize rings while any queues are active, even if the interface appears “down”. As resources are shared, entire adapter must be quiesced.

What worked:

# Flush and shut down all interfaces
for port in ens3f4 ens3f4d1 ens3f4d2 ens3f4d3; do
  ip addr flush dev $port
  ip link set $port down
done
# unload and reload driver to ensure 100% quiesced
modprobe -r cxgb4
modprobe cxgb4
# Now ring sizes can be fully set
ethtool -G ens3f4 rx 1024 tx 1024
ethtool -g ens3f4

# Repeat for any other ports needing ring size settings
# And now bring networking up again ...

making headway on the rest, tentative stuff for now, will update this as I go

2) On cxgb4, TX coalesce isnt configurable, so need to handle IRQ load in other ways:

RX coalescing is supported and settable (rx-usecs, rx-frames), but TX coalescing is effectively not exposed via ethtool (TX fields show n/a). Therefore if TX interrupt load becomes a problem, looks like we will instead handle it via IRQ affinity / CPU pinning / queue steering, or other ways, not TX coalesce.

3) On cxgb4, queues per interface are set automatically at driver initialisation, based on CPU count. While there’s no way exposed to manually modify this (except perhaps the card config files??), we may be able to modify the cores “seen” at driver init, to control queues created.

In other words, temporarily disable some CPU cores, initialise (or re-initialise) the driver, then re-enable the cores. Queues would stay at the lower count because they were created at driver init.