Fangtooth Nvidia GPU passthrough for Apps

Hi,
when trying to run Apps from the store [like Immich, OWU, Llama etc] I am only getting “Passthrough available (non-Nvidia) GPU’s”.
The GPU [Quadro P2000, tried a 3080 before which did the same thing] is being recognized in SCALE and by lspci, but seemingly has no drivers and cannot be loaded by Apps.
There also used to be an “Install Nvidia Drivers” option on EE, where is it now? is it just gone?

I just got mine working on SCALE 25.04, plex app, GTX 950:

Enable IOMMU in BIOS/UEFI:

Reboot your TrueNAS SCALE server.
Enter the BIOS/UEFI settings during boot (usually by pressing DEL, F2, F11, or F12).
Locate the IOMMU settings. This is often under sections like:
Advanced
System Agent (for Intel)
Northbridge/Chipset (for AMD)
Virtualization Technology
Enable IOMMU (or the equivalent setting).
Save the changes and exit the BIOS/UEFI.

Enable IOMMU in TrueNAS SCALE:

Open the TrueNAS SCALE Shell: You can do this from the web interface or via SSH.

Execute the midclt command to set the kernel extra options. Remember to use the correct IOMMU parameters for your CPU:

Intel:
midclt call system.advanced.update ‘{ “kernel_extra_options”: “intel_iommu=on iommu=pt” }’

AMD:
midclt call system.advanced.update ‘{ “kernel_extra_options”: “amd_iommu=on iommu=pt” }’

If you encounter issues with AMD, you might also need to include pci=assign-busses:
midclt call system.advanced.update ‘{ “kernel_extra_options”: “amd_iommu=on iommu=pt pci=assign-busses” }’

Verify the setting: You can verify that the option has been set by retrieving the advanced system settings:
midclt call system.advanced.get
Look for the “kernel_extra_options” field in the output and confirm it contains the parameters you set.

Reboot your TrueNAS SCALE server: The kernel extra options are applied during the boot process, so a reboot is necessary.
sudo reboot

once IOMMU is enabled, we do the following:

1. Verify GPU Detection

Open the TrueNAS SCALE shell and run:
lspci
grep NVIDIA

You should see output similar to:
09:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 950] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GM206 High Definition Audio Controller (rev a1)

Note the PCI Bus ID (e.g., 09:00.0).


2. Enable NVIDIA for Docker

Run:
midclt call -j docker.update ‘{“nvidia”: true}’


3. Reboot TrueNAS

Run:
sudo reboot


4. Verify Docker NVIDIA Runtime

After rebooting, run:
docker info
grep -i nvidia

You should see nvidia listed under Runtimes: and Default Runtime:.


5. Identify GPU UUID

Run:
nvidia-smi -L

Example output:
GPU 0: NVIDIA GeForce GTX 950 (UUID: GPU-7ae8def0-6839-1bbb-4b26-272e7560256f)

Note the UUID.


6. Update Application Configuration with UUID

Replace the placeholders (APP_NAME, PCI_SLOT, GPU_UUID) with your values.

Example command:
midclt call -j app.update plex ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:09:00.0”: {“use_gpu”: true, “uuid”: “GPU-7ae8def0-6839-1bbb-4b26-272e7560256f”}}}}}}’


7. Restart the Application

Through the TrueNAS SCALE web interface:
Apps → Installed Applications → Plex → Restart


8. Verify GPU Usage

In Plex Web UI:
Settings → Transcoder → Ensure Hardware Acceleration is enabled

Play a transcoded video and:
Check Plex Activity Dashboard → Status → Now Playing → Look for (hw) next to the stream

(Optional) Run on TrueNAS shell during transcoding:
nvidia-smi

2 Likes

Hi there. Alas, this method does not work for me. To the point even that if i start the jellyfin app, it completely crashes the entire truenas system.
I’ve tried to unbind the GPU with command

midclt call -j app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {}}}}’

To no avail.
Also resetting to default behavious through

midclt call -j app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: true}}}}’

Still crashes Truenas if I start the app.

Just for confirmation: Yes, IOMMU is on, yes the variables are correct in the command
midclt call -j app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:21:00.0”: {“use_gpu”: true, “uuid”: “GPU-8d457232-05dc-e284-d0d5-3d69018befec”}}}}}}’

2 points to note tho which may or may not have an effect:
1/ It’s a dual CPU server. One graphics card (P40) linked to CPU 1 for my ComfyUI instance, the other (GTX 1060) to CPU 2. Could it be that it’s the hardware which crashes if CPU 1 tries to access GPU2? if so, can I ‘link’ CPU2 to the jellyfin app?
2/ not all commands gave me the correct reply as you described. (But, nvidia-smi does list the 1060 as properly installed.

Somewhat at a loss here and unsure on how to get Jellyfin back to it’s original state before I started tinkering to get the transcoding working…

Help much appreciated, and thanks in advance!

Also - for anyone stumbling across this post - the webpage changes the quotes from the straight quotes to the rounded ones - so you need to either type in the command or copy it to a text editor and find/replace them…

1 Like

Not sure if you tried this, but I asked AI (I know, that’s cheating right :joy:)

Likely Causes of the Crash

1. Dual CPU Architecture (NUMA Split)

  • The server uses two CPUs, and the GPUs are attached to different CPUs:
    • P40 → CPU 1
    • GTX 1060 → CPU 2
  • If Jellyfin (or its Kubernetes pod) runs on a core tied to CPU 1 and tries to access the GPU on CPU 2, cross-CPU memory access (especially for IOMMU and DMA) may lead to instability or total system crashes.
  • This is a known issue in NUMA systems without proper isolation or affinity control.

→ Fix Suggestion:
Set CPU affinity or pin Jellyfin’s pod to a core on CPU 2 to match the GPU topology. Unfortunately, TrueNAS SCALE doesn’t currently offer direct CPU core pinning for apps in its GUI or midclt, but manually editing the Kubernetes pod deployment (if you drop to the k8s layer) could help.


2. Incorrect PCI Binding or VFIO Conflict

  • It’s possible that the NVIDIA driver or TrueNAS’s k8s pod runtime is trying to use the GPU before it’s fully bound or available, or is trying to unbind it while it’s in use.
  • A misconfiguration here (like conflicting VFIO bindings, or not releasing a GPU from another container or service) can also hard crash the system.

→ Fix Suggestion:
Ensure that no other service is using the GPU and confirm that VFIO-pci isn’t trying to claim it. Check with:

bash

CopyEdit

lspci -nnk | grep -A 3 -i nvidia

Make sure the NVIDIA driver is bound, not vfio-pci.


3. NVIDIA P40 + GTX 1060 Driver Incompatibility

  • Mixing data center (P40) and consumer GPUs (GTX 1060) can cause instability if the driver doesn’t fully support both in the same instance, especially in containerized environments.
  • The TrueNAS SCALE host uses a unified NVIDIA runtime. One driver version must work for both cards, and sometimes issues arise if one GPU uses features unsupported by the other.

→ Fix Suggestion:
Check that both GPUs are supported in the same NVIDIA driver version (nvidia-smi can confirm driver version).


4. Container Runtime GPU Assignment Issues

  • When configuring the GPU UUID in the app config, if the pod is misassigned or misconfigured (e.g., GPU not visible in the container), it may attempt a low-level GPU call and crash if access is denied or corrupt.

→ Fix Suggestion:
Try setting use_all_gpus: true temporarily and test if Jellyfin works without any GPU assignment. If it does, it confirms that the crash is GPU-passthrough related.


:hammer_and_wrench: How to Restore Jellyfin to a Working State

To fully reset the Jellyfin app’s GPU config:

bash

CopyEdit

midclt call -j app.update jellyfin '{"values": {"resources": {"gpus": {}}}}'

Then restart the app:

bash

CopyEdit

midclt call app.restart jellyfin

If that doesn’t work, uninstall and reinstall the app via the TrueNAS SCALE Web UI to wipe the config clean.


:mag: Additional Suggestions

  • Use journalctl -xe right after the crash (if accessible via SSH) to see kernel panic logs or GPU driver errors.
  • Run Jellyfin without GPU passthrough first. If it works fine, then it’s confirmed to be the GPU passthrough causing the issue.
  • Consider using only one GPU at a time per workload to simplify GPU passthrough setup.

My setup involves only 1 cpu, so hard to replicate/test these suggestions

1 Like

Out of curiosity - which one is this? (I’ve been troubleshooting with the help of deepseek) The differences are… interesting. (I’ll TS with that one then)
As to feedback -
1/ Both CPU’s should (according to HP) be able to access all GPUs. The difference is in bandwidth speed apparently

2/ My P40 is bound to vfio-pci, not the 1060.
(results from command:
admin@truenas[~]$ lspci -nnk | grep -A 3 -i nvidia
0a:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
** Subsystem: NVIDIA Corporation GP102GL [Tesla P40] [10de:11d9]**
** Kernel driver in use: vfio-pci**
** Kernel modules: nouveau, nvidia_drm, nvidia**
1f:08.0 System peripheral [0880]: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 QPI Link 0 [8086:0e80] (rev 04)
** Subsystem: Hewlett-Packard Company Xeon E7 v2/Xeon E5 v2/Core i7 QPI Link 0 [103c:18a8]**
1f:09.0 System peripheral [0880]: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 QPI Link 1 [8086:0e90] (rev 04)

21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] [10de:1c03] (rev a1)
Subsystem: PC Partner Limited / Sapphire Technology GP106 [GeForce GTX 1060 6GB] [174b:2438]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
21:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
Subsystem: PC Partner Limited / Sapphire Technology GP106 High Definition Audio Controller [174b:2438]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

So I gotta check that

3/ Driver is 550.142, and yes it both supports the P40 and 1060
Nvidia-smi only lists the 1060 (probably because the P40 is linked to vfio for some reason…)

4/ Jellyfin worked fine without the GPU. Once connected - death. Even if removed with that command (so it should be back to default) - death. So yeah i agree the issue is passthrough related.

This said (as extra info)
I’ve tried making a linux instance and connect the P40 to that one. Trying to create the image with the CPU in it - death.

I did not know the journal command tho. Running it now (system has ran for a day now) does give me this interesting feedback:
May 11 02:11:20 truenas kernel: NVRM: GPU at PCI:0000:21:00: GPU-8d457232-05dc-e284-d0d5-3d69018befec
May 11 02:11:20 truenas kernel: NVRM: Xid (PCI:0000:21:00): 62, pid=‘’, name=, 0001c02e 0001c01a 0001c074 0001bec5 0001bebe 0001bd88 00000011 00000000

That’s the 1060. (The P40 is 0a:00.00)

In fairness, I do have to mention I’m doing things that the server was never designed for… It was never designed for usb 3.2, nor 2.5Gbps network, nor the eclectic combination of supported and not supported disks (Some SSDs, one NVME) spread over an LFF drive bay and an SFF one (not allowed by HP) and a PCI card holding the NVME drive…
Every step of this build has been a pain i would not dare to inflict upon my worst enemies…

I’ll check the logs tomorrow and connect to the management console (which for some reason refuses to connect to the network, but that’s not Truenas’s fault) and report back. If meanwhile you (or AI) have further suggestions (at least the name) I’m all ears!

Thanks for the help!

Gemini premium :slight_smile:

Analysis of Your Situation

You’ve got a dual-CPU server with a P40 on CPU 1 and a GTX 1060 on CPU 2.

  • P40 (0a:00.0): vfio-pci bound. This is expected for passthrough to VMs (like your ComfyUI instance or a Linux instance). nvidia-smi won’t list it because the NVIDIA driver on the host isn’t managing it.
  • GTX 1060 (21:00.0): nvidia driver bound. This means the host OS (TrueNAS SCALE) is actively using the NVIDIA driver for this card. This is good for applications running directly on the host or in containers that access the host’s NVIDIA runtime.
  • Jellyfin Crash: The critical point is that Jellyfin crashes the entire TrueNAS system when you try to start it with GPU passthrough, even after trying to unbind or reset. The Xid (PCI:0000:21:00): 62 error points directly to the 1060. This is a common NVIDIA driver error indicating a serious issue with the GPU or its interaction with the system, often a hard reset or unrecoverable error.

Addressing the AI’s Suggestions and Your Findings

  1. Dual CPU Architecture (NUMA Split):
  • AI’s Suggestion: Set CPU affinity or pin Jellyfin’s pod to a core on CPU 2.
  • Your Observation: HP says both CPUs can access all GPUs, but bandwidth differs.
  • Relevance: Even with shared access, NUMA imbalances can cause issues. If Jellyfin’s process (or its Kubernetes pod) is scheduled on CPU 1 and is trying to access the 1060 on CPU 2, the latency and cross-QPI traffic for large data transfers (like transcoding) could theoretically contribute to instability, especially if there are underlying driver or IOMMU issues. However, the direct crash points more to a driver/hardware interaction than just NUMA latency.
  • TrueNAS SCALE Limitation: As the AI noted, direct CPU core pinning for apps in TrueNAS SCALE isn’t readily available in the GUI or midclt. This would require advanced Kubernetes manipulation, which is beyond the scope of typical TrueNAS app management and could be overwritten by updates.
  1. Incorrect PCI Binding or VFIO Conflict:
  • AI’s Suggestion: Ensure no other service is using the GPU and confirm vfio-pci isn’t trying to claim it.
  • Your Observation: P40 is vfio-pci (correct). 1060 is nvidia (correct for its intended use with Jellyfin).
  • Relevance: Your lspci output confirms the correct binding. The 1060 is managed by the nvidia driver, so vfio-pci isn’t interfering with it directly. The issue is more likely with how the nvidia driver or the container runtime handles the 1060 when Jellyfin tries to use it.
  1. NVIDIA P40 + GTX 1060 Driver Incompatibility:
  • AI’s Suggestion: Check driver support for both.
  • Your Observation: Driver 550.142 supports both. nvidia-smi only lists 1060.
  • Relevance: This is less likely to be the root cause since the driver version is recent and generally supports both architectures. The fact that nvidia-smi only shows the 1060 is because the P40 is effectively “invisible” to the host’s NVIDIA driver due to vfio-pci.
  1. Container Runtime GPU Assignment Issues:
  • AI’s Suggestion: Try use_all_gpus: true or running without GPU passthrough first.
  • Your Observation: Jellyfin works without GPU. Once connected (even after trying to remove it), it crashes. This strongly confirms the issue is GPU-passthrough related, specifically with the 1060 and how Jellyfin tries to interact with it.

Your Critical Xid 62 Error

The NVRM: Xid (PCI:0000:21:00): 62 is highly significant. An Xid 62 error means the NVIDIA driver detected an unrecoverable error or an internal error that requires the GPU to be reset. This often happens due to:

  • Hardware Issues: While less likely if the card works otherwise, a subtle instability could be revealed under load.
  • Driver Bugs: A bug in the NVIDIA driver itself, especially when interacting with specific hardware or kernel versions, or under specific memory access patterns (like those from a container).
  • Power Supply Issues: The GPU might not be getting enough stable power, leading to errors under load. This is especially relevant in systems pushed beyond their original design, as you mentioned.
  • Memory Corruption/IOMMU Issues: Data corruption during transfer to/from the GPU, potentially related to IOMMU, NUMA, or kernel memory management, can trigger Xid errors.

Next Steps and Further Troubleshooting

Given the Xid 62 and the full system crash, this points to a low-level interaction problem.

  1. Isolate the GTX 1060:
  • Uninstall ComfyUI (Temporarily): While the P40 is vfio-pci bound, it’s worth temporarily removing the ComfyUI app and any associated GPU passthrough settings for the P40. This is to absolutely ensure there’s no subtle interaction or resource conflict between the two GPUs or their respective drivers/passthrough mechanisms at the system level. You want to simplify the environment as much as possible.
  • Reboot: After any changes to GPU binding or app uninstallation, always reboot TrueNAS SCALE.
  1. Clean Jellyfin GPU Configuration:
  • You’ve tried: midclt call -j app.update jellyfin '{"values": {"resources": {"gpus": {}}}}'
  • And: midclt call app.restart jellyfin
  • If the crash persists even after this, and before trying to re-enable GPU passthrough for Jellyfin: This suggests the previous attempts at GPU binding have left a persistent state that’s causing the crash even without explicit GPU selection.
  • The Nuclear Option (as suggested by the AI):
    • Uninstall Jellyfin completely via the TrueNAS SCALE Web UI. This is crucial as it should purge all associated app data and Kubernetes configurations.
    • Reboot TrueNAS SCALE.
    • Reinstall Jellyfin from scratch.
    • DO NOT enable GPU passthrough immediately. Test if Jellyfin starts and functions normally without any GPU configured. If it does, then the base app installation is sound.
  1. Check Power and Physical Connections:
  • This might sound basic, but in a custom build, especially with dual CPUs and multiple GPUs, ensuring robust and sufficient power delivery is paramount.
  • Double-check all PCIe power cables to the GTX 1060 and ensure they are seated firmly and coming from a sufficiently powerful power supply.
  • Ensure the card itself is seated correctly in its PCIe slot.
  1. Advanced Debugging (if the above doesn’t work):
  • journalctl -xe immediately after crash: You’ve already started this. Look for anything beyond the Xid 62 that indicates a kernel panic, memory allocation failure, or other hardware-level errors. You might need to pipe this to a file if it’s long: journalctl -xe > /var/log/jellyfin_crash.log.
  • TrueNAS SCALE Logs: TrueNAS SCALE has its own logging system. While journalctl shows kernel-level events, the TrueNAS UI or cli might offer more specific app-related logs if the system doesn’t immediately crash. Look in System Settings -> Advanced -> System Dataset -> Export Debug or explore /var/log for Kubernetes logs if you’re comfortable.
  • Temporary GPU removal: If all else fails, can you physically remove the P40 (temporarily) and try to get Jellyfin working with only the GTX 1060 in the system? This would further isolate if there’s any interaction between the two cards.
  1. Consider an Older NVIDIA Driver (If possible):
  • TrueNAS SCALE’s NVIDIA driver is part of its base OS. Downgrading isn’t straightforward. However, sometimes newer drivers introduce regressions. If you find others with similar hardware having issues with 550.142, it might be a consideration if and when TrueNAS offers a different driver version. (This is usually a last resort).

The “HP Says Both CPUs can Access” Caveat

While HP states both CPUs can access all GPUs, the manner of access can be critical. True NUMA-aware applications are designed to optimize memory placement. Containerized environments (Kubernetes) add another layer of complexity. If the system is trying to perform DMA (Direct Memory Access) from CPU 1 through the QPI link to a device on CPU 2’s PCIe bus, and there’s any instability in the QPI link, IOMMU, or the driver’s handling of these cross-socket accesses, it can lead to hard crashes.

It’s a very difficult problem without direct CPU affinity controls.

Summary of Actions

  1. Temporarily remove ComfyUI app and P40 passthrough settings.
  2. Reboot TrueNAS.
  3. Completely uninstall Jellyfin via the TrueNAS SCALE UI.
  4. Reboot TrueNAS.
  5. Reinstall Jellyfin.
  6. Test Jellyfin without any GPU passthrough.
  7. If it works, carefully attempt to re-enable GPU passthrough for the 1060, paying attention to the exact GPU UUID and steps.
  8. If it crashes again, collect detailed journalctl -xe logs immediately after the crash.
  9. Physically inspect power connections to the 1060 and its seating in the slot.

Sorry if these posts sound lazy on my part lol

1 Like