NVIDIA GPUs not working and nvidia-smi fails to communicate with NVIDIA drivers

After Upgrading from TrueNAS 24.10.2 to 24.10.1 I’m still getting the following from nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and have still been unable to use NVIDIA drivers. Tried uninstalling and reinstalling the drivers via the web interface as well as rebooting multiple times but still no success.

It was not working before either (v24.10.1).
It worked for a brief period of time immediately after installing/re-installing the boot-pool; however, after a reboot the NVIDIA drivers stopped working and never come back to a working state and nvidia-smi has been outputting that ever since. I tried with the following nvidia GPUs: 1060,3080,3090 they work perfectly fine on windows and they worked immediately after installing the boot-pool but after the reboot they stopped working.

Originally posted in update to 24.10.2 update thread but was informed to post here

My current system:

  • CPU: Threadripper 3995WX 16-Core
  • Motherboard: WS WRX80E-SAGE SE WI-FI
  • GPU: NVIDIA 1060 & NVIDIA 3090
1 Like

Go browse some other threads and do the Tutorial by the Bot, if you haven’t done so already. It will get your user trust level up and you can post images.

TrueNAS-Bot
Type this in a new reply and send to bring up the tutorial, if you haven’t done it already.

@TrueNAS-Bot start tutorial

@TrueNAS-Bot start tutorial

Its unusual that someone has 3 GPUs to test in a system

Can you go through the test sequence you used.
Did each GPU work for a while?
Were they all installed at the same time?

The new Nvidia driver is 550.142
It does seem to have each of the GPUs
But I doubt we tested with the combination.

In the driver info:

Blockquote

Note that the list of supported GPU products is provided to indicate which GPUs are supported by a particular driver version. Some designs incorporating supported GPUs may not be compatible with the NVIDIA Linux driver: in particular, notebook and all-in-one desktop designs with switchable (hybrid) or Optimus graphics will not work if means to disable the integrated graphics in hardware are not available. Hardware designs will vary from manufacturer to manufacturer, so please consult with a system’s manufacturer to determine whether that particular system is compatible.

Blockquote

Is it possible your GPUs are integrated into a motherboard expecting Windows?

No they are not Integrated graphics cards, all 3 are discrete graphics cards. The sequence I used was the following:
1 - Added and plugged in NVIDIA 1060 & NVIDIA 3090 into PCIE slots

2 - Installed drivers from GUI for both 1060 and 3090 and both were working for transcoding for plex, and selectable from other apps as well.

3 - Rebooted system and nvidia drivers stopped working.

4 - Uninstalled drivers via GUI and reinstalled the drivers again.

5 - Tried rebooting a few times no change, and reinstalling drivers and rebooting again in between.

6 - Tried unplugging power for one of the GPUs and turning on the system to see if it would work, I alternated which GPU was provided power, still did not work.

7 - Reinstalled boot-pool (I think with both GPUs powered on? Or maybe only 1?) and the drivers were working again and selectable in docker apps again (although to use the GPU in the app had to remove and reinstall the app).

8 - To make sure rebooting isn’t what was causing the issue, I rebooted the system, and immediately ran nvidia-smi and got the error again, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

9 - At this point I practically gave up, but I had one more GPU I could try in case it had something to do with the GPUs or something (unlikely but whatever, worth a try). I powered of the TrueNAS system and removed and replaced the 3090 with a 3080 from my gaming desktop, turned on the system, and ran nvidia-smi again, but still was treated to the same error.

10 - Powered off the machine and unplugged all GPUs (and just left them in their PCIE slots) to save power since they weren’t doing anything anyways, and just used the system without a GPU.

11 - Then this week I saw the change log about v24.10.2 potentially fixing the issue, so after power off TrueNAS, I replaced the 3080 with the original 3090. Plugged in both 3090 and 1060 and ran the upgrade to v24.10.2, ran nvidia-smi and was disappointed again.

12 - Tried uninstalling the GPUs and reinstalling again, didn’t make a difference, tried rebooting a few times, also made no difference.

Note: When running the Install NVIDIA Drivers the check boxes selected are the “stable” and “community” check boxes.

I am having the same issue however I am running a Tesla P4 passed from a Promax host.

Is it worth trying with just 1 GPU at a time?

That way we have a simple reproduction case.
Does the issue only occur with 2 GPUs of different models?

Part of the challenge we face is differentiating between Linux driver issues and TrueNAS issues.

  1. Went to Apps > Settings and unchecked “Install NVIDIA Drivers” to uninstall nvidia drivers.

  2. Unplugged my NVIDIA 1060 leaving only the NVIDIA 3090 plugged in.

  3. Reinstalled boot-pool, but this time I installed directly to 24.10.2 instead of 21.10.1

  4. Restored my pool to my pool1 I had before reinstalled the boot-pool.

  5. Ran nvidia-smi to see if anything is installed.
    And have the following output:
    image

  6. I assumed that the output would be something like “this command does not exist” or something so to verify if the drivers are installed I went to Apps > Settings and the check box “Install NVIDIA Drivers” is checked. Maybe I uninstalled the drivers after saving a copy of my configuration?

  7. To confirm if the Apps now have the ability to connect to the 3090 GPU I opened the configuration of plex, and it looks like the 3090 can be used for the plex app.

    image

  8. To verify now if it will persist after a reboot, I rebooted the system and immediately ran nvidia-smi to be greeted with the error NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. as usual.

So doesn’t seem to matter whether its only 1 GPU at a time or if its v24.10.1 or v24.120.2 neither persist after a reboot.

1 Like

Thanks for a great diagnostic sequence. Can you capture a debug and report a bug.

In the meantime, i did find a “private” bug report that was similar.

https://ixsystems.atlassian.net/browse/NAS-133898

The user diagnosed this… do you have a similar issue:

  1. However, lspci -k shows that the GPU is being used by the nouveau driver, despite being blacklisted in /etc/modprobe.d/blacklist-nouveau.conf.

Yes, here is the output of lspci -k the nvidia section (with only the 3090 plugged in):
image

Output of cat /etc/modprobe.d/blacklist-nouveau.conf:
image

Here is the output of grep -i nvidia /var/log/messages

Cannot Signup to submit the Jira ticket due to never receiving the confirmation email code. So this is the best I can do. Also, if the private bug report is the same, couldn’t this be just added to it?

I’ve linked this thread/post to the jira ticket… it looks to be the same problem.

Noveau is taking over as the driver… I have no idea why?

But these seem like the same issues stated in this post from 2023, but with no solution and everyone just gave up hope or purchased a new GPU that ended up working for them (got lucky): Nvidia GPU not appearing for use with SCALE | Page 3 | TrueNAS Community

The output of cat /etc/modprobe.d/nvidia.conf:
cat: /etc/modprobe.d/nvidia.conf: No such file or directory

Output of find /lib/modules/6.6.44-debug+truenas/ -type f -name '*.ko' | grep nvidia:
/lib/modules/6.6.44-debug+truenas/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/6.6.44-debug+truenas/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko
/lib/modules/6.6.44-debug+truenas/kernel/drivers/net/ethernet/nvidia/forcedeth.ko

Output of find /lib/modules/6.6.44-production+truenas/ -type f -name '*.ko' | grep nvidia:
/lib/modules/6.6.44-production+truenas/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/video/nvidia-drm.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/video/nvidia-modeset.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/video/nvidia.ko
/lib/modules/6.6.44-production+truenas/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko

Output of ls -l /etc/modprobe.d/:
total 15
-rw-r--r-- 1 root root 154 Sep 2 2023 amd64-microcode-blacklist.conf
-rw-r--r-- 1 root root 44 Jan 27 20:56 blacklist-nouveau.conf
-rw-r--r-- 1 root root 154 May 29 2024 intel-microcode-blacklist.conf
-rw-r--r-- 1 root root 379 Feb 24 2023 mdadm.conf
-rw-r--r-- 1 root root 101 Feb 24 2023 nvdimm-security.conf

Note: First time I ever installed TrueNAS on my system was like 2 months ago

Were any of the GPUs working with a specific TrueNAS version.

While the issues appear to be similar, they may not be the same. The Linux kernel, Nvidia drivers and Nouveau driver are constantly changing. At this stage, It’s not clear whether its a problem for all Nvidia GPU users or some GPU users and there is something unusual about your set-up (history or hardware).

This week, I’d expect we’ll know how many users are impacted. Feel free to report a bug.

The previous bug in 24.10.1 : NAS-133250 / 24.10.2 / fix loading nvidia kernel modules (by yocalebo) by bugclerk · Pull Request #15343 · truenas/middleware · GitHub

workaround was: you could add a post-init script that runs modprobe nvidia_drm .

The other issue that might be relevant is this one:

@HoneyBadger Does this problem sound like the same one or am I barking up the wrong tree?

Ran midclt call -job docker.update '{"nvidia": true}'
and the ouput was:
Status: Requested configuration applied
Total Progress: [########################################] 100.00%
{"id": 1, "pool": "pool1", "enable_image_updates": true, "nvidia": true, "address_pools": [{"base": "172.17.0.0/12", "size": 24}], "dataset": "pool1/ix-apps"}

nvidia-smi was still reporting, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

And cat /var/log/app_lifecycle.log has nothing displayed.

Doesn’t seem to be the solution for me.

1 Like

This one is different - the UUID issue crops up when we get the correct nvidia driver bound, but the GPU UUID changes between installs/versions.

For some reason the open-source nouveau driver is snagging the GPU. I was troubleshooting a similar scenario a few days ago.

@Raikyu has your GPU for some reason decided to try to isolate itself under the System → Advanced menus?

No it has not tried to isolate itself; however, my GPU is in the dropdown list of GPUs that can be isolated.

If you can stomach a couple of reboots, try isolating it, rebooting - check lspci to ensure it got picked up by vfio here, and then un-isolate, reboot again, and see if nouveau got the message to leave it alone.

Cannot isolate when only one GPU is plugged in.
image

Going to reboot and try to also connect the 1060 GPU as well and try.