NVIDIA GPUs not working and nvidia-smi fails to communicate with NVIDIA drivers

When isolated (Settings > Adbvanced Settings > Isolated (3090):
lspci -k the driver in use becomes, vfio-pci
image

After unisolating, the driver in use is still vfio-pci
image

After reboot its back to using the nouveau driver:
image

At no point did the output of nvidia-smi start outputting something other than the the same error as previously reported, and at no point was the GPU visible to be selected by a TrueNAS App. I did notice that when rebooting after changing isolated GPU settings, turning it off and rebooting causes the system to not reboot properly, and need to be manually powered off and powered back on.

Capabilities state “<access denied>”

NVIDIA 3090:
lspci -v -d 10de:2204 outputs the following (Video):

lspci -v -d 10de:1aef outputs the following (Audio):
image


NVIDIA 1060:
lspci -v -d 10de:1c02 outputs the following (Video):

lspci -v -d 10de:10f1 outputs the following (Audio):


I assume the following would probably be the fix:
$ sudo nano /etc/modules-load.d/vfio.conf

Type in the following lines in the /etc/modules-load.d/vfio.conf file.
vfio
vfio_iommu_type1
vfio_pci

Then run sudo update-initramfs -u -k all but cannot do so on this read only system.

in case this link is useful for a solution: https://thelinuxforum.com/articles/895-how-to-configure-proxmox-ve-8-for-pci-pcie-and-nvidia-gpu-passthrough

Prepending your lspci with sudo should resolve this, but the key here is that even after isolating and un-isolating, the nouveau driver seems to be refusing to be blacklisted.

Try midclt call boot.update_initramfs with the card in an un-isolated state. Going through the middleware should allow TrueNAS to update itself.

Ok, yeah gave more details after adding sudo.

But the output for midclt call boot.update_initramfs was just False.

In a previous case with Nouveau issues, the resolution was to remove it.
Given the current software, I assume the addition of Nvidia driver is not needed? @HoneyBadger

Manually unloading shouldn’t be necessary as we’ve blacklisted the module in /etc/modprobe.d/blacklist-nouveau.conf - but yet it still loaded it by default for me on a clean install of 24.10.2 :thinking:

truenas_admin@truenas[~]$ sudo lspci -k | grep NVIDIA -A2
81:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Kernel modules: nouveau

In my case though, ticking the box under Apps → Config → Settings → Install NVIDIA Drivers resolved almost immediately and put it to the correct driver:

truenas_admin@truenas[~]$ sudo lspci -k | grep NVIDIA -A2
81:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
        Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
truenas_admin@truenas[~]$ nvidia-smi
Wed Feb  5 06:51:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       Off |   00000000:81:00.0 Off |                  Off |
| N/A   29C    P0             22W /   75W |       0MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@Raikyu Please submit a bug report through the webUI of your current install state and include a debug file so that our Engineering team can investigate this in more detail

Did you try rebooting and running nvidia-smi again after the reboot?

Yep, timeline below.

Clean installed 24.10.2 → nouveau
Hit the checkbox under Apps → nvidia
Reboot → nvidia
Isolate without reboot → nvidia
Reboot → vfio-pci - running nvidia-smi fails as expected
Un-isolate without reboot → vfio-pci
Reboot → nvidia

At no time did I manually rebuild initramfs or do any modprobeing at the CLI.

Hmm…:thinking: maybe because I have normal consumer grade GPUs?
Guess consumer cards just aren’t tested or are unsupported with TrueNAS?

Unlikely; in fact the Tesla P4 isn’t listed in the supported cards (which would have made it more likely to fail) but I can redo this with a 750Ti or 3070 if that’s a concern.

Have you submitted a ticket/debug yet? If not please do and include the ticket number here :slight_smile:

Tried to but since it requires Jira, need to make an account and was not receiving the email confirmation. I’ll try again later.

I’ll DM you :slight_smile:

Circling back, the issue here appears to be that the debug kernel was enabled (System → Advanced → Kernel) - hopefully @Raikyu will confirm shortly after a reboot here that the nvidia driver is properly loaded and claiming all of their cards.

Yeah not sure how debug debug kernel was enabled but after disabling it in System > Advanced Settings > Kernel Configure and rebooting the drivers started working again (had to recreate Apps that need GPU though in order to use the GPUs).

looks like its working.
image

image

1 Like

Great to hear. The debug kernel seems to be the cause of the issue in another 25.04 case as well.

It explains why most users haven’t seen the issue, but you have. Its not a common problem,. Were you running a nightly image at any stage?