No GPU´s in isolation list

Hello, I am new here.
I did a fresh install of ElectricEel-24.10.0 a couple of days ago. Everything seems to work normal. I have installed nvidia drivers from Apps->Configuration->Settings.
Nvidia-smi can see my two gpu’s. RTX3070 and Quadro P2000
Driver Version: 550.127.05 CUDA Version: 12.4

I had the exact same hw in Truenas Scale 24.04 and could isolate the 3070 for a Win11 vm. Worked great.
But now I don’t have any of the gpu’s in the isolation list. They also don’t show in gpu list when creating a vm. But I can see them under PCI-passthrough devices for vm. But that fails if I try to add it that way.

Any tips on why and what I can do here?
Do you need more info etc?

Thanks

1 Like

I have the same issue and no solutions right now. I dont know why not working after the update. Maybe a fresh reinstall can help? On discord someone had the same issue but after rebooted the device and worked fine.
But the nVidia GPU is showing in the apps edit page. So the truenas see the GPU but cant isolate it.
I have an AMD iGPU and Quadro PCI-e GPU but I cant isolate any of them. I also need isolated GPU for VM.
Now I wondering to install a regular linux distro and start using like a PC my NAS.

I dont think a fresh install will help, I tried that.
I could also see my gpu’s in apps edit page (plex/jellyfin)
But today, after 5 hours of uptime, nvidia-smi suddenly can’t see them at all:
“Failed to initialize NVML: No supported GPUs were found
Unable to determine the number of GPUs”
Whats going on?

Joining the discussion as I have the same problem. What was working fine in TrueNAS-SCALE-24.04.2.5 is broken in Electric Eel 24.10.2:

Isolated GPUs empty

  • UI option to install NVIDIA driver in Apps->Configure->Settings is missing
  • Manual NVIDIA driver install via CLI works:
midclt call -job docker.update '{"nvidia": true}'
  • nvidia-smi suddenly starts working after Unset->Set Apps pool.
  • My fresh unencrypted Apps dataset lies on an encrypted pool and so the Apps service only starts after I unlock the encrypted pool via password. That might explain why the unset-set apps pool workaround works.
  • GPUs not available in the passthrough PCI-e device when editing/creating a VM.
  • Force-refreshing browser tab doesn’t do anything. Also tried different browser.
  • When installing a new app the NVIDIA GPUs are being shown and deploys fine. So it’s just the GPUs not being listed in the “Isolated GPUs” menu.
    grafik
  • GPUs are listed and able to isolate on same hardware when I boot 24.04.2.5 image. Confirmed this by trying fresh installs of 24.10.0.2, 24.10.0.0 and 24.04.2.5. Will gather the state of 24.04.2.5 where it’s working as intended.

:information_source: Created a bug ticket: Jira (Click on the :eye: symbol to start watching for updates.)

@Jon Your mileage may vary and I don’t know what mainboard and CPU you have, but I was able to get it working on 24.10 by enabling native ACR support in the BIOS. Maybe 24.10 uses a newer linux kernel that needs native ACR support for GPU isolation to work in some cases. Maybe the ‘system.advanced.config.kernel_extra_options’ were changed in 24.10. We wait for the devs to clarify, meanwhile here are the steps I took:

Fixing the IOMMU groups for TrueNAS SCALE 24.10.x

:information_source: In my understanding, the GPUs will not be available for apps after doing this because they will be controlled by the NVIDIA driver installed within the VM. However, I found no information regarding this (admittingly logical) limitation.
First, Configure virtualization settings in BIOS:
As documented by SuperMicro for H11/H12 MoBos: https://www.supermicro.com/support/faqs/faq.cfm?faq=31883
:information_source: The available BIOS options will vary by mainboard brand and model

For H12/H11 series motherboard with Rome CPU (EPYC 7xx2) installed, please use latest BIOS and enable below items under BIOS
BIOS >> Advanced >> CPU configuration >> SVM Mode >> Enabled
BIOS >> Advanced >> PCIe/PCI/PnP Configuration >> SR-IOV Support >> Enabled
BIOS >> Advanced >> NB Configuration >> ACS Enable >> Enabled
BIOS >> Advanced >> NB Configuration >> IOMMU >> Enabled
BIOS >> Advanced >> ACPI settings >> PCI AER Support >> Enabled

In newer BIOS revision with Rome processor, please enable AER first then ACS will be exposed"

# Boot to TrueNAS, in UI go to System -> Shell and switch user to root:
sudo su root

# List VGA devices' IOMMU group(s) and other devices in same group(s):
for vga in $(lspci -D | awk '/VGA/ {print $1}'); do
    group=$(readlink /sys/bus/pci/devices/$vga/iommu_group | awk -F'/' '{print $NF}')
    echo "IOMMU Group $group:"
    ls -l /sys/kernel/iommu_groups/$group/devices/ | awk '{print $9}' | xargs -I {} lspci -s {}
    echo
done
# List advanced config and pipe to jq for readability:
midclt call system.advanced.config | jq
# List cmdline options
cat /proc/cmdline             
# Reset extra kernel options:
midclt call system.advanced.update '{"kernel_extra_options":""}'
# Check advanced config change:
midclt call system.advanced.config | jq -r '.kernel_extra_options'
reboot now

# This should now output "softdep nvidia pre: vfio-pci"
cat /etc/modprobe.d/vfio.conf
# Confirm the GPUs are available for isolation, should be visible in UI also
midclt call system.advanced.get_gpu_pci_choices | jq
# Confirm GPUs are isolated, should be visible in UI also
midclt call system.advanced.config | jq -r '.isolated_gpu_pci_ids'
# Check IOMMU groups of the GPUs
for vga in $(lspci -D | awk '/VGA/ {print $1}'); do
    group=$(readlink /sys/bus/pci/devices/$vga/iommu_group | awk -F'/' '{print $NF}')
    echo "IOMMU Group $group:"
    ls -l /sys/kernel/iommu_groups/$group/devices/ | awk '{print $9}' | xargs -I {} lspci -s {}
    echo
done

# Confirm Binding: Check if kernel driver in use is vfio-pci
lspci -nnk | grep -A 4 -i vga
# Another way to confirm successful binding to vfio:
sudo dmesg | grep -i vfio

# Try to create a VM through the UI and attach isolated GPU(s) to it. It should start with no errors.

Last resort: If native ACR is not supported, consider applying ACR override patch, but this can compromise security and integrity!
ACR override patch reference:
https://lkml.org/lkml/2013/5/30/513
https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Bypassing_the_IOMMU_groups_(ACS_override_patch)

I wish you the best of luck that you don’t have to use the ACS override patch.

Hello. Thanks for the answers.
I actually have H12SSL-i and Epyc 7282 as well.
I tried to passthrough the nvidia 3070 vga and audio controller to the win11 vm again after a couple of reboots and it just worked without isolation.
And after updating to 24.10.0.2 and installing nvidia drivers again from Apps->Configuration->Settings, it also could see my other quadro p2000 card and its working with the docker apps.
So for now everything works as it should. The GPU isolation list is still empty, but I dont mind.
I did not change any settings in the bios by the way. And nvidia-smi shows only the quadro p200 card.
Thanks again for helping out.

Dang that’s a nice coincidence. Would be interesting to know whether you have AER and ACS enabled in your BIOS. At the very least this is a UI bug. The fact that it works on your end sporadically after some reboots indicates that the vfio driver gets through eventually, but most of the times the NVIDIA kernel modules and possibly the driver is loaded first, thus hindering isolation. I wouldn’t recommend installing NVIDIA drivers on the TrueNAS host via the UI option under apps so it can’t interfere with vfio trying to grab the cards to be able to isolate them. That is unless you also want to be able to use GPU PCIe passthrough for apps. Dunno the exact mechanism that TrueNAS uses to unload/load drivers on-demand.