[Not Accepted] PCIE Tunneling Support [Regression between 24.04 and 24.10]

–revised post—
It seems i was mistaken about truenas not having a kernel that supports this scenario. Apologies for not thinking to test dragonfish first.

I tried installing 24.04.0 (24.10 was my first time installing truenas) i found this scenario works perfectly in dragon fish first release.

  • 6.6.20-production+truenas where it works
  • 6.6.44-production+truenas where it doesn’t work

tl;dr on dragonfish my USB4 connect GPU works 100% fine (i can query it with nvidia-smi) on electric eel it does not.

On 24.10.1 nvidia driver fails to install due to the kernel module not loading:

[  294.025596] VFIO - User Level meta-driver version: 0.3
[  294.280088] nvidia-nvlink: Nvlink Core is being initialized, major device number 235

[  294.285531] nvidia 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  294.285628] nvidia 0000:0f:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[  294.285652] NVRM: The NVIDIA GPU 0000:0f:00.0
               NVRM: (PCI ID: 10de:1e07) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  294.285706] nvidia: probe of 0000:0f:00.0 failed with error -1
[  294.285719] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  294.285720] NVRM: None of the NVIDIA devices were initialized.
[  294.285928] nvidia-nvlink: Unregistered Nvlink Core, major device number 235

Issue filed [NAS-132394] NVIDIA Drivers Don’t Work with USB4 connected card [regression from 24.04.00] - iXsystems TrueNAS Jira

I moved post to feature requests as this is still a feature request to make this a supported test scenario, this would have trapped the regression.

—original post—

Adding of full USB4 support would be useful inclusion, but is highly dependent on kernel versions. I am willing to help figure out what, if any, LTS kernel versions this works on, i am willing to compile my own kernel versions if required to assist.

What scenarios is this useful:

  • Connecting USB4 connected GPUs to assist in VMs running AI tasks
  • Connecting additional high speed network interfaces (PCIE tunneling)
  • Use of software connection manager for 40Gb interconnects between servers (USB4 P2P connections)

I understand this is niche, not asking for anyone to break a leg getting this in, just include the support for the scenarios as upstream kernel features and fixes come to LTS Kernels.

Sub feature request to this would be to include intel-tbtools package.

What is the use-case… home lab? or ???

If I understand correctly, this is about Thunderbolt support, i.e. the Thunderbolt 3 compatibility layer in USB4. This would require substantial hardware testing, and should then be backed up by a substantial use case.

1 Like

Tunneling is very interesting for laptop folk who want to add a eGPU to their laptop, for example. T3 vs. T4 doesn’t change the maximum bandwidth (40 Gbps), but the minimum is doubled from 16 to 32Gbps. In real life use, that will allow more monitors or bigger ones to be attached, for example.

For TrueNAS, the use case is a bit murkier since SAS expander boxes exist, they are supported, and can work reliably. Tunneling via T3/T4 seems perhaps more oriented towards NVME and eGPU applications, both of which are likely better accommodated with a bigger case, a bigger motherboard, and a CPU with more spare PCIe lanes?

2 Likes

See this is what most people get wrong, USB4 is thunderbolt, it is a super set of all TB3 features plus some more, intel donated the spec to the USB4 forum.

USB4 is a routed protocol, so what most people think of USB (i.e. 3.2 and lower) is in your words ’ a compatibility layer’ to USB4.

So for example, a USB4 mass storage device can operate in one of two modes:

  • USB mode when plugged into a USB3.x port

  • SCSI device mode when plugged into a USB4 port (this is not tunneling, i am just using to try and reinforce one has to not be confused by it being called USB4 - it is no longer a serial bus.

what you described as a TB3 layer is not ‘a TB compatibility’ it is fundamental and required feature of the USB4 spec to be USB4 spec compliant, it is not something that is optional for USB-40, USB-80 or USB-120.

So yes this is about tunneled PCIE, no this is not about TB3 - this is about USB4 tunneled PCIE.

So this is about is USB4 a substantial use case and to be tested, if not your argument would assume all USB4 support should be removed from the kernel. Remember most of the heavy lifting here for USB4 is being done upstream in the kernel.

My post was maybe premature and was based on testing i had done on 24.10 before release where some of my USB4 mass storage devices caused kernel panics, these seem fixed in latest trunas kernel.

Infact PCIE tunneling seems to be working fine with the exception of the PCIE D3COLD bug - this is a regression from earlier kernels than 6.6 where this bug was not present, it seems to have reared it’s head in 6.6LTS (i have yet to trrack down when the regression in the Linux kernel happened) i know it is fixed in 6.8 and then broken again after that…

I am willing to do the hard work of finding the regression point and patch, i think this is an important scenario as it opens up many possibilities on the newest class of low-end segment NASs that have USB4 on them (e.g. ZimaCube Pro, latest Terramaster etc) and for anyone creating NAS on latest workstation chipsets.

Less clear when / if that matters on enterprise class.

Yes this is niche, but its darn useful on a U2 server or SFF NAS to suddenly be able to plug in a variety of PCIE devices (network cards, nvme expanders, u2 pcier cards, and yes gpus) via a USB4 connection. They will be far more prevalent in 12mo than external oculink or mcio connection.

PS if you were referring to just tbtools, you also don’t seem to understand they are fundamental USB4 tools - they show all the USB4 routing information, they just happen to be called tbtools because legacy reasona (see my first 2 parargrpahs)

2 Likes

Interesting i wasn’t contemplating passing PCI devices through to containers, but interesting if that is coming.

I was really thinking about host functions - for example if you add a host connected GPU, NIC, nvme expander, then the host, container or a VM can use those in anyway supported by the host for other PCIE devices.

For example a GPU with vGPU functions can be used on host, container or VM simultaneously.

For me this is less about PCIE pasthrough to VM or container, but i can see why there would be folks who would want that too.

My request is about making tunneling fundementally a supported scenario where regressions are checked for. The code is already in the current TruNAS kernel, there is just a critical regression.

I won’t say this is the fault of the upstream kernel maintainers, it is more a product of the amount of changes in PCIE power management coupled with the USB4 code.

One can fine a plethora of similar D3COLD issue also affecting other classes of devices (certain NVMEs connected to certain NVME controllers). It is obvioulsy a hard issue as it coming and going in different kernel version.

Any device that has USB4 controller that you want to add arbitrary PCIE hardware too but that has run out of slots, can’t fit the size of thing you want to fit.

Curretly this is only possible on:

  • SOHO Class NAS based on 12th gen+ intel mobile parts
  • Workstation Class NAS based on latest AMD or Intel workstation chipsets

this is why i tagged this as a feature request for next year sometime.
to be clear all the code is already in the kernel, we are just missing some patches - yes i understand the burden of including new patches and testing them for a niche sceanrio

this doesn’t invalidate the request

Oh interesting this is 100% a regression between truenas versions

I just install 24.04 (24.10 was my first ever version) to see if it ever worked, apologies folks for not doing that before.

6.6.20-production+truenas where it works
6.6.44-production+truenas where it doesn’t work

this could be either or both of these issue:

  • recent issues with kernel in general power management issues (not sure how much the truenas folks backport
  • problematic NVidia Driver

What is your use-case.? is it for 1 laptop or thousands of machines?

My use case is multiple NAS, thanks for asking.
I don’t use laptops for my infrastructure.

If its a commercial offering, we’d prefer you DM me or contact iX.

If you are keen to contribute/test code… that would be appreciated.

1 Like

yes and if you have something like a ZimaCube Pro and you already filled your PCIE slots / the GPU doesn’t fit what you are proposing is 'oops sell your NAS and buy a new one"

Long term as we see USB4 evolve beyond USB-40 to usb-80, USB-120 etc it will become a convenient and ubiquitous way to add devices, and support for pcie tunneling is non-optional for USB4, i.e. if one asserts that should not work in Truenas then all USB4 and thunderbolt functionality should be pulled from the truenas kernel, there is no half-pregnant here IMHO.

1 Like

For fun, lets contrast how different vendors approach this regression in the linux kernel that affect LTS kernels (some work, some don’t) and main line kernels (some work and some don’t).

Truenas:
“Bug Clerk
3 hours ago
Thanks for the but ticket but this is outside the scope of what we support. I’d recommend reaching out to our forums to see if someone can help you.
This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.”

Proxmox:

“Please test other kernels to find the point of regression and we can investigate” (doesn’t mean they will fix it of course)

FWIW, this isn’t really a “TrueNAS” thing. It’s a kernel issue. As you yourself admit this is new technology limited to the bleeding edge of hardware.

From TrueNASs perspective, an NVME drive is just a pcie device. If you can tunnel it through USB as a “fabric” then it’s really not inherently different than plugging it into your motherboard (performance limitations of bus speed not withstanding). There’s nothing new “to support” in the NAS.

I’d suspect that the Linux kernel will be in a better place in the next release cycle (TrueNAS 25.04). TrueNAS is really at the mercy of the LTS kernel. If PCIE Tunneling isn’t stable in 6.6 (as evidenced by the fact it broke, it certainly seems to be unstable), then much like our bretheren who bought into Intel Arc early, you’ll just have to wait patiently.

With Canonical betting on 6.11 in Ubuntu 24.10 I’d suspect we’ll see 6.11.x become LTS… and be in TrueNAS in the spring. But expecting engineering man hours be invested NOW for something brand new that’s broken upstream in the LTS kernel… really isn’t a fair criticism of an appliance operating system (even if it was “working” before).

Do you have any experience on the newer kernel trains?

2 Likes

I got it working.

Thanks all.

I wont bother sharing the fix given all people want to do is give me grief.

1 Like

i have more experience with kernels and bugfixes than i as a non-coder have any right to, but hey thanks for asking.

This was a REGRESSION from earlier kernels, it worked perfectly in EARLIER kernels.

This is NOT A BLEEDING EDGE ISSUE*

To be quite clear, I’m not trying to give you any grief. I am quite interested in this myself. Low cost and low power scalable/modular DIY NAS solutions is like the holy grail.

This was more of a question like, “Have you tried Ubuntu 24.10 or similar, does it work there?” not “have you compiled the kernel from source and tried to debug?”

That’s also understood. But this is a regression in the LTS i.e. “conservative” Linux kernel. We can probably agree, it’s not a kernel branch where you’d EXPECT this kind of regression. But this an upstream problem.

Where I think we’d disagree is you’re asking TrueNAS to roll back or investigate, forgoing other improvements or limited man hours, for a niche usecase. It’s more realistic to expect upstream to fix an upstream issue in my opinion. TrueNAS already uses the newest LTS kernel branch available.

It really is tho. Thunderbolt is/was a niche usecase basically only used in Apple land for it’s entire existence. The rest of the industry is playing catchup here. USB4 kind of flips USB on its head.

1 Like

I never asked for any such thing, i asked for this to be a supported and tested sceario in a 2025 release, But hey keep creating strawmen for you to attack.

This is quite a reasonable ask. I never had an expectation as to wether they would or wouldn’t, that shouldn’t stop anyone around here from making the ask for their scenario, and it is my right as a user to say why i think something is important, you may not agree, i am sure there are scenarios you think are important and that i don’t care about or think are important for certain segements adress by the scale or enterprise editions. That doesn’t mean they are invalid sceanios just because i don’t want them.

As for niche scenarios, truenas has a tons of those, you seem to be accepting the niche scenarios you care about and rejecting the ones you don’t care about, probably in some mistaken zero sum game analsysis, nice gate-keeping,

Laslty on the technicality of this, i am still not sure that you understand that this feature is the same feature we had in TB3 and before - (same code, minor fixes) this scenario works back well into 5.x series kernels including LTS. This is not a new feature in Linux we have had code for pcie tunnelling since 2014, hardly ‘bleeding edge or new’. I have seen this work on 5.x LTS series kernels. I only asked for it on USB4 as thats the hardware i have.

For any one else (porbably no one this far down in this thread) here is the fun truenas box that i have and that this is thread is about, it is to allow me test multiple different ZFS configurations, performance, use cases, etc

Next up AI model testing.

  • 6 * 24 TB Spinning Drives
  • 4 * 4GB NVME
  • 2 * 960GB NVME
  • 2 * 960GB OptaneSSD (external and hacky…)
  • 3 * 256Gb NVM
  • 1 x nvidia 2080ti
  • 2 x 2.5gbe
  • 1 x10gbe
  • 64GB Ram

I think we’re talking past each other . As I’ve said, I’m also interested in seeing this work.

Best of luck. I hope you change your mind and decide to share your findings with the community, rather than whatever this thread has become.