When isolated (Settings > Adbvanced Settings > Isolated (3090): lspci -k the driver in use becomes, vfio-pci
After unisolating, the driver in use is still vfio-pci
After reboot its back to using the nouveau driver:
At no point did the output of nvidia-smi start outputting something other than the the same error as previously reported, and at no point was the GPU visible to be selected by a TrueNAS App. I did notice that when rebooting after changing isolated GPU settings, turning it off and rebooting causes the system to not reboot properly, and need to be manually powered off and powered back on.
Prepending your lspci with sudo should resolve this, but the key here is that even after isolating and un-isolating, the nouveau driver seems to be refusing to be blacklisted.
Try midclt call boot.update_initramfs with the card in an un-isolated state. Going through the middleware should allow TrueNAS to update itself.
In a previous case with Nouveau issues, the resolution was to remove it.
Given the current software, I assume the addition of Nvidia driver is not needed? @HoneyBadger
Manually unloading shouldn’t be necessary as we’ve blacklisted the module in /etc/modprobe.d/blacklist-nouveau.conf - but yet it still loaded it by default for me on a clean install of 24.10.2
In my case though, ticking the box under Apps → Config → Settings → Install NVIDIA Drivers resolved almost immediately and put it to the correct driver:
truenas_admin@truenas[~]$ sudo lspci -k | grep NVIDIA -A2
81:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
truenas_admin@truenas[~]$ nvidia-smi
Wed Feb 5 06:51:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P4 Off | 00000000:81:00.0 Off | Off |
| N/A 29C P0 22W / 75W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
@Raikyu Please submit a bug report through the webUI of your current install state and include a debug file so that our Engineering team can investigate this in more detail
Unlikely; in fact the Tesla P4 isn’t listed in the supported cards (which would have made it more likely to fail) but I can redo this with a 750Ti or 3070 if that’s a concern.
Have you submitted a ticket/debug yet? If not please do and include the ticket number here
Circling back, the issue here appears to be that the debug kernel was enabled (System → Advanced → Kernel) - hopefully @Raikyu will confirm shortly after a reboot here that the nvidia driver is properly loaded and claiming all of their cards.
Yeah not sure how debug debug kernel was enabled but after disabling it in System > Advanced Settings > Kernel Configure and rebooting the drivers started working again (had to recreate Apps that need GPU though in order to use the GPUs).