Devices Using PCIE Tunneling Having Issues (i.e. over USB4/TB4)

I have a NAS that has USB/TB4 ports.

I have nvidia GPU, BMC 10gb network card and Mellanox ConnectX-4 LX all connected by TB4/USB4.

None of these devices work.

All these devices work fine on the right mix of linux kernels. For example the BMC and NVIDIA work fine this way on ZimaOS.

the common root to all is “Unable to change power state from D3cold to D0, device inaccessible” I have had this issue on ubuntu 24.04 and it was resolved with kernel

this was solved by upgrading to kernel 6.8.4-060804 Ubuntu 24.04 - Unable to change power state from D3cold to D0, device inaccessible - Graphics / Linux / Linux - NVIDIA Developer Forums

With the rise in USB-40 capable hardware (i.e. USB4 / TB4) PCIE tunneling enabled by this technology is interesting in multiple NAS scenarios in the coming year. I don’t think it is urgent to fix this i would advocate it for the first 2025 release as the first round of USB4/TB4 motherboards supporting software connection manager will be released in Sept / Oct 2024.

for the NVidia 2080ti dmesg only shows this (device 25 is the NVidia)

[  471.106763] pci 0000:25:00.2: Unable to change power state from D3cold to D0, device inaccessible
[  583.781488] pci 0000:25:00.2: Unable to change power state from D3cold to D0, device inaccessible

for the mellanox dmesg shows

oot@truenas[~]# dmesg | grep mlx
[    1.536529] mlx5_core 0000:39:00.0: firmware version: 14.32.1010
[    1.536568] mlx5_core 0000:39:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.2 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
[    3.546198] mlx5_core 0000:39:00.0: poll_health:819:(pid 0): Fatal error 1 detected
[    3.546234] mlx5_core 0000:39:00.0: print_health_info:423:(pid 0): PCI slot is unavailable
[   62.582296] mlx5_core 0000:39:00.0: wait_func:1172:(pid 257): INIT_HCA(0x102) timeout. Will cause a leak of a command resource
[   62.582308] mlx5_core 0000:39:00.0: mlx5_function_open:1242:(pid 257): init hca failed
[   62.597225] mlx5_core 0000:39:00.0: probe_one:1952:(pid 257): mlx5_init_one failed with error code -110
[   62.597250] mlx5_core 0000:39:00.0: mlx5_fw_fatal_reporter_err_work:679:(pid 98): health works are not permitted at this stage
[   62.598666] mlx5_core: probe of 0000:39:00.0 failed with error -110
[   62.600153] mlx5_core 0000:39:00.1: Unable to change power state from D3cold to D0, device inaccessible
[   62.600263] mlx5_core 0000:39:00.1: mlx5_pci_vsc_init:61:(pid 257): Failed to get valid vendor specific ID
[   62.600271] mlx5_core 0000:39:00.1: firmware version: 65535.65535.65535
[   62.600277] mlx5_core 0000:39:00.1: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.2 (capable of 4032.000 Gb/s with 64.0 GT/s PCIe x63 link)
[   82.602352] mlx5_core 0000:39:00.1: wait_fw_init:206:(pid 257): Waiting for FW initialization, timeout abort in 100s (0xffffffff)
[  102.606349] mlx5_core 0000:39:00.1: wait_fw_init:206:(pid 257): Waiting for FW initialization, timeout abort in 79s (0xffffffff)
[  122.610351] mlx5_core 0000:39:00.1: wait_fw_init:206:(pid 257): Waiting for FW initialization, timeout abort in 59s (0xffffffff)
[  142.614353] mlx5_core 0000:39:00.1: wait_fw_init:206:(pid 257): Waiting for FW initialization, timeout abort in 39s (0xffffffff)
[  162.618353] mlx5_core 0000:39:00.1: wait_fw_init:206:(pid 257): Waiting for FW initialization, timeout abort in 19s (0xffffffff)
[  182.610350] mlx5_core 0000:39:00.1: mlx5_function_enable:1145:(pid 257): Firmware over 120000 MS in pre-initializing state, aborting
[  182.610408] mlx5_core 0000:39:00.1: probe_one:1952:(pid 257): mlx5_init_one failed with error code -16
[  182.614454] mlx5_core: probe of 0000:39:00.1 failed with error -16
root@truenas[~]# dmesg | grep bnx2x  
[    1.479752] bnx2x 0000:0f:00.0: msix capability found
[    1.480196] bnx2x 0000:0f:00.0: part number 0-0-0-0
[   11.510310] bnx2x: [bnx2x_fw_command:3054(eth%d)]FW failed to respond!
[   11.510317] bnx2x 0000:0f:00.0 (unnamed net_device) (uninitialized): bc 7.13.75
[   11.510322] bnx2x: [bnx2x_fw_dump_lvl:794(eth%d)]\x013MCP PC at 0xffffffff
[   11.510324] bnx2x: [bnx2x_fw_dump_lvl:815(eth%d)]Trace buffer signature is missing.
[   11.510326] bnx2x: [bnx2x_prev_unload:10893(eth%d)]MCP response failure, aborting
[   11.510474] bnx2x 0000:0f:00.1: msix capability found
[   11.510485] bnx2x 0000:0f:00.0: msix capability found
[   11.510780] bnx2x 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   11.510789] bnx2x 0000:0f:00.1: Unable to change power state from D3cold to D0, device inaccessible
[   11.510945] bnx2x: PCI device error, probably due to fan failure, aborting
[   11.511035] bnx2x: PCI device error, probably due to fan failure, aborting

(funny as the card has no fan and works perfectly with ZimaOS - a no name startup NAS OS)

This is the bleeding edge. You yourself realize that the hardware for doing this is brand new. If a newer kernel fixes this issue, then I suspect it will come to TrueNAS whenever it ends up in an LTS.

TrueNAS is a stable enterprise product. zimaOS can afford to run on less conservative software trains.

1 Like

yup that’s why i said this would be good for 2025 release, this isn’t my first rodeo

For sure. I’m just saying, I don’t know that any TrueNAS changes are required other than a future kernel :slight_smile: They are just PCI-E devices over a different fabric.

USB4 incorporates Thunderbolt 3; Thunderbolt 4 is a step above USB4.

Anyway, the official position so far is that Thunderbolt is not officially supported by TrueNAS and therefore not tested. I doubt that any enterprise customer of TrueNAS cares about Thunderbolt, and that defines the amount of money and developer time that iX is willing to spend on it.

@NickF1227 agreed thats likely, but i make no assumptions about what truenas does and doesnt chose in the kernel configmake menu options, so just logging it here for posterity in the offchance it might influence them.

To refine their further, Thunderbolt 4 is a just a certification and a specific USB4 controller. It is the superset of all USB4 mandatory and optional features including the USB-40, interdomain channel bonding, tunneled PCIE, tunneled USB (as in 2 and 3.x), DP ALT mode , oh a mandatory PD support i thin. and lastly a logo and, in theory, a quality certification.

But as i am currently with fighting with a mobo manufacturer who says they are TB4 and can’t do some of the things i listed i am uncler the certification is worth much (not to mention a cable supplier with certified cables that don’t do inter-domain channel bonding).

I am quite excited once we have broad penetration and understanding of TB4 / USB-40 (and the upcoming USB-80). This was my first dabble in all of these 2 years ago, turned out quite well. proxmox cluster proof of concept (github.com)

Wrong Storage OS :stuck_out_tongue:
In general, TB would “work” if the devices are just exposed as PCI-E devices. The problem you have right now is a power state problem, which is not surprising to see (very early days!).

But as @etorix said, this would not be an officially supported feature used in Enterprise, and iXsystems probably wouldn’t be testing it.

If there are bugs or things that don’t work when the kernel is here, you would have to make a “Feature Request” and iX would have to choose to work on it or not. Additional packages would likely not be included, so it’s really up to the Linux kernel here. Latest Feature Requests topics - TrueNAS Community Forums

oops woke up 15m ago, still on top third of my giant mug-o-tea™

The power problem is a persistent on again and off again bug, actually there are earlier mainstream kernels than the one in TrueNAS Scale 24.10 that work just fine with this.

It is about finding the right combo of kernel and PCIE power management settings and kernel flags. there are 5.x series kernel where this works fine.

So it is a testing and prioritization issue i.e. is the scenario worth supporting and worrying about - not a question of ‘waiting for the upstream kernel to be fixed, per-se’

Thanks for point me to the feature request thread, i will go log something, I just started evaluating TrueNAS as a candidate for my future Synology replacement at end of year. eGPU over TB is core as that’s a great way to add GPU for AI and other VFs.

Right but the LTS kernel thats in EE is in the 6.6 branch. The link you shared in Ubuntu 24.04 was regarding moving to the 6.8 branch, which is not LTS.

root@prod[~]# uname -r
6.6.44-production+truenas

See: The Linux Kernel Archives - Releases

If a patch comes into the 6.6 branch, it may wind up in a TrueNAS 24.10.x release or potentially you’d have to wait for TrueNAS 25.04 (assuming the cadence is the same).

2 Likes

I assume you mean the 6.6 LTS - i admit i have only had this working on normal kernel.org 6.1, 6.5, etc

What else can i try with a kernel.org 6.6 LTS kernel (i also want to see if it has my 6.5 connection and IPv6 fixes for thunderboltb-net)

Seems like a useful link

https://www.reddit.com/r/debian/comments/18ko759/does_debian_12_have_kernel_66/

Has suggested distros with 6.6, also suggest compile your own :wink:

TNS generally only adopts a Linux kernel when Linus declares it LTS. The current most recent LTS kernel is 6.6.

Maybe 6.11 will be declared LTS… who knows. We’re about due.

1 Like

thanks, i can roll my own if/when i have time, done it before so not huge PITA but i only do it once every year to two so have to put a dev image back together will all the right tools (luckily i saved the instructions, lol).

You probably are not interested, just updating the post incase anyone searches and arrives here.

Turns out this is a regression from dragonfish, i only tried dragonfish for the first time today.

of course, this doesn’t mean IX will class it as a regression to be fixed :slight_smile:

[NAS-132394] NVIDIA Drivers Don’t Work with USB4 connected card [regression from 24.04.00] - iXsystems TrueNAS Jira4)