Bug Report Summary for TrueNAS SCALE
Title: Total Ingress Packet Loss on Mellanox ConnectX-4 (40GbE) due to rx_steer_missed_packets Regression
System Information:
- Product: TrueNAS SCALE 25.04.2.4
- Kernel Version: 6.12.15-production+truenas
- Hardware: Mellanox ConnectX-4 EN (OEM, PSID: MT_2150110033)
- Firmware: 12.28.4704
- Link Speed: 40Gb/s
- **Dual 100Gbe Card
Problem Description: A Mellanox ConnectX-4 EN card experiences total ingress packet loss when operating at 40GbE. The ethtool -S counters show that packets are received at the physical layer (rx_packets_phy) but are immediately dropped by the hardware, as indicated by the rx_steer_missed_packets counter incrementing in lockstep. The OS-level rx_packets counter remains at 0. The kernel log (dmesg) shows a clean driver initialization with no errors.
Key Diagnostic Finding: This is definitively a software regression bug in the mlx5_core driver. The exact same hardware configuration (card, transceiver, cable, switch) functions perfectly in a Proxmox environment running a newer 6.14.8 kernel, ruling out any hardware or firmware fault.
Failed Mitigation and Diagnostic Attempts: All standard Linux workarounds and diagnostic methods for this type of issue have failed, indicating the kernel is compiled without the necessary features to address this at runtime.
devlinkSteering Mode Change: Attempting to switch from the defaultdmfstosmfsfails.
devlink dev param set pci/... name flow_steering_mode value smfs cmode runtime- Result:
Error: mlx5_core: Software managed steering is not supported by current device.
tcHardware Offload: Attempts to manually program a hardware steering rule fail.
tc filter add... matchall skip_sw- Result:
RTNETLINK answers: Operation not supported tc filter add... u32 skip_sw- Result: The command is accepted, but
tc -s filter showreveals the rule isnot_in_hw, proving the offload silently failed.
- Driver Debug Logging: Attempting to enable verbose driver logging via a loader tunable fails.
- Setting loader tunable
mlx5_core.debug_mask=1 - Result: Error on boot indicating
Sysctl 'mlx5_core.debug_mask' does not exist in kernel.
Conclusion: The mlx5_core driver in this version of TrueNAS SCALE has a bug that prevents it from correctly programming the hardware receive steering table. The kernel is compiled in a way that disables all known workarounds (alternate steering modes, TC hardware offload) and prevents the collection of advanced diagnostic logs. Resolution requires a patched driver from the TrueNAS development team.
(more detailed)
Technical Report for Advanced Kernel/Driver Analysis
Title: Analysis of rx_steer_missed_packets on Mellanox ConnectX-4 EN (OEM) at 40GbE Link Speed on TrueNAS SCALE (Kernel 6.12.15)
1. Executive Summary:
An OEM Mellanox ConnectX-4 EN 100GbE network card (PSID: MT_2150110033) is experiencing total ingress packet loss when installed in a server running TrueNAS SCALE (25.04.2.4, Kernel 6.12.15-production+truenas). The card is physically connected to a 40GbE switch port using a 40GbE transceiver, and the link successfully negotiates and establishes at 40Gb/s. Despite a clean kernel log and a link-up state, all incoming packets are dropped by the hardware before reaching the OS networking stack. The issue is identified by the rx_steer_missed_packets counter in ethtool, which increments in lockstep with the rx_packets_phy counter, while the OS-level rx_packets counter remains at zero. The exact same hardware configuration (card, transceiver, switch) functions perfectly on a Proxmox server running a newer Linux Kernel (6.14.8-2-pve), proving the hardware is not faulty. The root cause is hypothesized to be a specific, silent regression bug within the mlx5_core driver in the 6.12 kernel that is triggered by the combination of this OEM card’s firmware and the 40GbE link speed. Standard mitigation attempts, including firmware updates (to the latest OEM version) and manually forcing link speed, have failed.
2. System Configuration & Comparative Analysis:
| Component | Failing System (TrueNAS SCALE) | Working System (Proxmox) |
|---|---|---|
| Operating System | TrueNAS SCALE 25.04.2.4 | Proxmox VE 8.x |
| Linux Kernel | 6.12.15-production+truenas | 6.14.8-2-pve |
| Network Card | Mellanox ConnectX-4 EN (MCX416A-CCA) | Mellanox ConnectX-4 EN (MCX416A-CCA) |
| Card PSID | MT_2150110033 | MT_2150110033 |
| Card Firmware | 12.28.4704 (Latest OEM) | 12.25.1020 (Older OEM) |
| Interface Name | ens6f0np0 | ens2f0np0 |
| Driver | mlx5_core | mlx5_core |
| Driver Version | 6.12.15-production+truenas | 6.14.8-2-pve |
| Transceiver | 40GbE QSFP+ | 40GbE QSFP+ |
| Switch | 40GbE Capable Switch | 40GbE Capable Switch |
| Negotiated Speed | 40000Mb/s | 40000Mb/s |
3. Key Diagnostic Data & Observations:
-
Primary Symptom: ping fails (no response). No ingress traffic of any kind is processed. Egress traffic (tx_packets) increments normally.
-
Critical Counter Evidence (TrueNAS):
-
rx_packets: 0
-
rx_steer_missed_packets: >0 (and increments with traffic)
-
rx_packets_phy: >0 (and increments with traffic)
-
rx_steer_missed_packets value is nearly identical to rx_packets_phy.
-
-
Kernel Log (dmesg on TrueNAS): The mlx5_core driver initializes the card without any errors or warnings. It correctly identifies the firmware, PCIe bandwidth, and reports “Link up”.
-
Link State (ethtool on TrueNAS): The link is correctly detected at Speed: 40000Mb/s with Link detected: yes.
-
Firmware Status: The card is an OEM model, restricted to the 12.x firmware branch. The latest available firmware (12.28.4704) has been successfully applied, but this did not resolve the issue.
-
Comparative Success: The Proxmox system, with an identical hardware setup (but a newer kernel), functions perfectly. This is the definitive control case, isolating the problem to the TrueNAS kernel/driver environment.
4. Failed Mitigation Attempts:
-
Firmware Update: Updated card from 12.25.1020 to 12.28.4704. No change in behavior.
-
Manual Link Speed Configuration: Using ethtool -s ens6f0np0 autoneg off speed 40000 successfully sets the link but does not resolve the packet drop issue. This suggests the bug occurs during the driver’s initial resource allocation at probe time, not during the negotiation phase.
-
MFT Tools on Host: mlxconfig confirms the card is an Ethernet-native “EN” model and does not have a LINK_TYPE parameter.