Nvidia Drivers in Electric Eel

Truen · September 26, 2024, 1:43am

Hello everyone. I’m a bit confused with how an Nvidia GPU works in TrueNas Electric Eel. I can see that the system recognises the GPU in advanced settings, but Emby doesn’t use it for some reason. I tried to do what was recommended in the version notes, but got this:

admin@truenas[~]$ midclt call -job docker.update ‘{“nvidia”: true}’
Status: Removing apt packages
Total Progress: [######################################__] 96.00%
[EFAULT] Command /root/tmphyu2fm7b/NVIDIA-Linux-x86_64-550.120-no-compat32.run --tmpdir /root/tmphyu2fm7b -s failed (code 1):
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.120…

ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 469, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 513, in __run_body
rv = await self.middleware.run_in_thread(self.method, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1356, in run_in_thread
return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1353, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/concurrent/futures/thread.py”, line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py”, line 65, in install
self._install_driver(job, td, path)
File “/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py”, line 133, in _install_driver
subprocess.run([path, “–tmpdir”, td, “-s”], capture_output=True, check=True, text=True)
File “/usr/lib/python3.11/subprocess.py”, line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command ‘[’/root/tmphyu2fm7b/NVIDIA-Linux-x86_64-550.120-no-compat32.run’, ‘–tmpdir’, ‘/root/tmphyu2fm7b’, ‘-s’]’ returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 469, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 511, in __run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 49, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 179, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/plugins/docker/update.py”, line 84, in do_update
await (
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 417, in wait
raise self.exc_info[1]
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 473, in run
raise handled
middlewared.service_exception.CallError: [EFAULT] Command /root/tmphyu2fm7b/NVIDIA-Linux-x86_64-550.120-no-compat32.run --tmpdir /root/tmphyu2fm7b -s failed (code 1):
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.120…

ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

I tried blacklisting Nouveau but it didn’t work, and I don’t know what else I can do. Any tips?

Mark_Smith · October 4, 2024, 8:31am

I don’t suppose you worked this one out ? Got the exact same issue … click the “Install NVIDIA drivers”, wait a bit and then get a wall of text saying it failed etc. Server has an Tesla P4 in it which be just fine.

naspin · October 18, 2024, 9:45am

I have exactly the same error. Anyone got any suggestions?

duderuud · October 18, 2024, 9:48am

Did you install the Nvidia driver?

naspin · October 18, 2024, 10:37am

yes I have the install Nvidia drivers selected in settings

DjP-iX · October 18, 2024, 12:54pm

@Truen and @naspin, what GPU devices do you have installed?

Edit: I guess primarily @naspin, didn’t realize how old this thread was.

naspin · October 18, 2024, 2:26pm

Nvidia T400 4gb. Despite the error the OP listed, after rebooting It looks like I can see Nvidia showing inside the container, using portainer I can also see the card as being available. I also can see the card if i run nvidia-smi.

I’ll have to setup plex to see if it actually transcodes a file when playing, or if any errors persist.

DjP-iX · October 18, 2024, 5:41pm

I believe Plex may still have an issue in the RC pending a bug fix that is included in 24.10.0, but the T400 should be supported by the driver version that 24.10 installs.

HoneyBadger · October 18, 2024, 6:09pm

The current driver is good back to Maxwell series cards (Quadro M-series, GTX 900+, GTX 750) - if it’s detected by nvidia-smi then the bump in the road is with the Docker subsystem or something in the app itself.

HoneyBadger · October 18, 2024, 8:35pm

After some prompting, it looks like it might be related to a (known and hopefully fixed in 24.10.0) bug regarding Plex specifically. Jellyfin transcodes just fine on my Tesla P4 card despite it showing as an “Unknown” NVIDIA GPU.

krowvin · October 19, 2024, 7:41pm

I was googling all over trying to figure out why nvidia-smi was not working on Electric Eel. I couldn’t get the config to accept either.

When I added:

version: "3.3"
services:
  plex:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu

I would get the error:
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

And it would not deploy.

I found your post @truen

Ran midclt call -job docker.update '{"nvidia": true}' (making sure to convert the encoded ‘ and ” you have to ’ and "

It finished, and BAM. I have nvidia-smi and I can start the container.

Thanks!

krowvin · October 19, 2024, 7:43pm

I’m not sure how you got to this in the settings. I don’t see it in electric eel version
TrueNAS-24.10-BETA.1

Kushan · October 30, 2024, 1:34pm

The setting is (somewhat un-intuitively) under the “Apps” settings.

Agent_Special · November 10, 2024, 10:21pm

Hello Krovin,
What is the path of the file you edited?

Best!

HoneyBadger · November 10, 2024, 10:23pm

Hey @Agent_Special

This thread is related to a setting and command-line that was needed back in the 24.10-BETA software. In the 24.10 RELEASE version (and even earlier in the RC) the checkbox to install the drivers can be found under Apps → Settings → “Install NVIDIA Drivers” and then click Save.

scyto · November 11, 2024, 12:47am

BTW the OP’s error message can still happen on release versions, another cause is if the kernel module fails to load for any reason after compilation dmesg can be used to see if there are any nvidia failures - for example like D3COLD error events. If dmesg has these errors the card will be unreliable at best and non-functional at worse, even if it appears in the drop down for, say, a VM.

Itamar_Budin · November 14, 2024, 1:02pm

I have the same issue with the latest ElectricEel v24.10.0.2. I am running NVIDIA P4000

chadc · December 9, 2024, 12:55am

I have the exact same issue. I upgraded to Electric Eel when it went GA but only noticed it now when Plex stopped hardware transcoding.

I’ve yet to find a solution.

This is what I see in dmesg

725005.604205] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[725005.607470] NVRM: This can occur when another driver was loaded and 
                NVRM: obtained ownership of the NVIDIA device(s).
[725005.607471] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.

chadc · December 9, 2024, 4:23pm

I rolled-back to 12.10 and ticked the “Install NVIDIA drivers” option in Apps. The driver installed successfully and I was able to run nvidia-smi from the TrueNAS shell. I then upgraded to 12.10.02. When I logged back in to TrueNAS post-upgrade, I noticed a new job running called “Installing NVIDIA drivers” and none of my installed apps appeared in the Apps section of the UI. After the job completed, my apps re-appeared.

My Plex instance still doesn’t use the GPU for transcoding and the dropdown in Plex does not display an available GPU, but one problem at a time .

(I’m sure this is all intended behavior but it may help someone that has the same issue!)

EDIT: ran the commands outlined in this post (I forgot to do this when I rolled-back and re-upgraded) and it’s all working now.

RobertKandels · February 26, 2025, 9:56pm

How long did it take to download the driver?