Unable to use nvidia GPU with Jellyfin after 24.10 upgrade

duk · October 29, 2024, 8:01pm

Hi,

Upgraded to 24.10 from 24.04.2.3 and everything works fine, except ticking “Use this GPU” in the configuration for Jellyfin.

I enable the nvidia driver using the following command:
midclt call -job docker.update '{"nvidia": true}'

Error on trying to update app:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 535, in __run_body
    rv = await self.middleware.run_in_thread(self.method, *args)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1364, in run_in_thread
    return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1361, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/service/crud_service.py", line 268, in nf
    rv = func(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 55, in nf
    res = f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 183, in nf
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/crud.py", line 287, in do_update
    app = self.update_internal(job, app, data, trigger_compose=app['state'] != 'STOPPED')
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/crud.py", line 317, in update_internal
    update_app_config(app_name, app['version'], new_values, custom_app=app['custom_app'])
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/ix_apps/lifecycle.py", line 59, in update_app_config
    render_compose_templates(
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/ix_apps/lifecycle.py", line 50, in render_compose_templates
    raise CallError(f'Failed to render compose templates: {cp.stderr}')
middlewared.service_exception.CallError: [EFAULT] Failed to render compose templates: base_v1_1_4.utils.TemplateException: Expected [uuid] to be set for GPU inslot [0000:01:00.0] in [nvidia_gpu_selection]

lspci -v output:

01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Hewlett-Packard Company TU116 [GeForce GTX 1660 Ti]
        Flags: bus master, fast devsel, latency 0, IRQ 16, IOMMU group 1
        Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

nvidia-smi output:

Tue Oct 29 20:55:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off |   00000000:01:00.0 Off |                  N/A |
| 40%   37C    P0             N/A /  120W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Any insights?

DaveInCanada · October 29, 2024, 8:50pm

No insights - just here to report the identical issue (and identical error messages referencing the expected [uuid] to be set for GPU inslot as posted).

Only thing I have to add is that this existed for me on RC2 as well. I updated today to the official release, and it didn’t alter the behavior. I’ve done 2 app updates since then on Jellyfin and no change.

Edit: I enabled the NVIDIA drivers in the GUI rather than from shell; that’s one other minor difference between our setups.

duk · October 29, 2024, 9:24pm

Good to know. Feared that me doing it through the CLI messed it up.

asalinasbendeck · October 29, 2024, 11:03pm

+1 on the same error here. I thought this was going to get fixed with the official release

DjP-iX · October 30, 2024, 12:59pm

Does nvidia-smi -L return the UUID for the GPU?

ABain · October 30, 2024, 1:03pm

https://ixsystems.atlassian.net/issues/NAS-132086

Open ticket, see note on there from our development team:

Hello, can you uncheck the GPU, save,
Open the page on a new, preferably private tab (so its not cached).
Then try to edit there and enable the GPU again.

Aasikki · October 30, 2024, 1:20pm

Yes, uuid is returned for me when I run that command.

Jaxdia · October 30, 2024, 2:13pm

I assume there’s no way to tag yourself on an open issue? I also have this problem, and unfortunately the private tab didn’t change anything.

And DjP-iX, that command also returns the UUID for my GPU as well.

ABain · October 30, 2024, 2:37pm

Create an account if you don’t already have one, add a comment to the issue and if you’d like click the bugclerk private upload link and upload a debug to assist with the investigation!

CtagaDev · October 30, 2024, 3:31pm

I’ve had the same issue since I started testing version 24.10, and I have a theory about what might be happening.

When I first installed version 24.10-Beta1 on a new NAS, I set up the NVIDIA drivers from the console without any issues. However, after updating to 24.10-RC1, the NVIDIA drivers were uninstalled, so I had to reinstall them, this time using the graphical user interface (GUI). At that point, containers using the GPU started experiencing issues with the UUID. If I disabled the GPU option, the containers would run fine; but when I re-enabled it, they stopped working. After investigating a bit, I found that the only way to fix this was to delete the application and reconfigure it from scratch.

From 24.10-RC1 to 24.10-RC2 same problem.

Now, with the final release of 24.10, I’m facing the same issue related to the GPU UUID. The NVIDIA drivers were not installed, so I reinstalled them from the application settings. The issue lies in the fact that each time the NVIDIA driver is reinstalled, the GPU UUID changes, causing applications that use the GPU to attempt to access the previous UUID.

My theory is that when TrueNAS is updated, the NVIDIA drivers are uninstalled. When they are reinstalled, whether through the CLI or the GUI, the GPU UUID changes, but the containers continue pointing to the old UUID, which leads to issues when starting applications.

Sorry for my English.

DaveInCanada · October 30, 2024, 3:41pm

Your English is excellent - please don’t apologize.

I think your theory is the same one as the development team was kind of pursuing there. It makes sense, except the error being shown is that the UUID isn’t set at all when creating the container, as opposed to the UUID being invalid. I tried deleting the app and re-creating it to see if that helped, and it did not. My theory is that it’s not passing an argument when creating the app from the Apps section of the GUI. I’m out of town for a couple of days, but when I get back, I’ll try to create the app manually and see if it works that way (assuming we haven’t got a resolution by then).

Yannick_Charron · October 30, 2024, 3:50pm

I have the same issue that Plex wont boot if I check the gpu box. I installed the drivers from the config page then I unchecked it and the option disappeared. Cleared cache, tried from a new browser, private window and options is still missing. nvidia-smi is fine.

Aasikki · October 30, 2024, 4:33pm

For me it works without any problems when creating a new instance of an app, though I never deleted the old one, just made another instance.

CtagaDev · October 30, 2024, 4:42pm

Thank you very much @DaveInCanada ; I’m not a native speaker and I don’t usually use English regularly.

I’ve found the issue, at least in my case.

To fix it, it’s necessary to modify a file from the command line interface (CLI), so each person should proceed at their own risk.

In the user_config.yaml file, located in the ixVolume volume (a Dataset automatically created by the system), in my case it’s found at /mnt/.ix-apps/user_config.yaml. (in your case it may be another route)

This file contains all the configurations for installed applications. Upon reviewing it, I found a specific section related to the GPU (in every app you can use GPU):

resources:
  gpus:
    nvidia_gpu_selection:
      '0000:07:00.0':
        use_gpu: true
        uuid: ''  <<-- the problem
        use_all_gpus: false

As you can see, the uuid field is empty ('') when it should contain the UUID obtained by running nvidia-smi -L. I replaced all empty uuid fields with the correct UUID (without quotes, directly GPU-xxxxx…), saved the file, and restarted the application from the graphical user interface (GUI). Everything worked smoothly.

In addition, you need to define the IOMMU group for your GPU. In my case, it is 0000:07:00.0, which can be found using the command lspci -Dnn | grep -i NVIDIA. Thanks to @Mortorojo for the helpful addition.

It seems that the issue occurs because, for some reason, Docker leaves the uuid field empty after an update, possibly because the driver isn’t installed at that moment, or for some other unknown reason.

At least now I don’t have to reconfigure all applications to make them work with the GPU.

Mortorojo · October 30, 2024, 4:55pm

Mirgated from 23.04 to 24.10. I do not have /tank/.ix-apps/user_config.yaml, no hidden directory for .ix-apps/user_config.yaml. Only see /poolname/ix-application

CtagaDev · October 30, 2024, 5:00pm

In my case, since it’s a fresh installation on a new NAS, Docker was already configured from the start.

If you’re migrating from the previous version, where Kubernetes was used instead of Docker, the folder might be different. I suggest checking within your directory ix-application for a file named user_config.yaml or any other file with a .yaml extension.

To find the UUID, run the following command in the terminal:

cat filename.yaml | grep uuid

If you see something like uuid: '', that’s the file you’ll need to modify.

EDIT: Sorry, I put the path wrong, it is /mnt/.ix-apps not /tank/.ix-apps. I have corrected the original post.

Mortorojo · October 30, 2024, 5:07pm

ix-applications only contains json’s no yaml’s, they must be storing that somewhere else on migrated apps

Aasikki · October 30, 2024, 5:09pm

For me the file was at /mnt/.ix-apps/user_config.yaml

Used nano to edit it:
sudo nano /mnt/.ix-apps/user_config.yaml

Can confirm that after adding in the correct uuid, I was able to enable the gpu for jellyfin and hardware acceleration works!

CtagaDev · October 30, 2024, 5:20pm

It’s possible. This is my first NAS using TrueNAS, and I started directly with version 24.10; before, I was using OpenMediaVault.

I don’t have experience migrating from version 23.04 (which used Kubernetes) to 24.10 (which uses Docker), but the .yaml file should be somewhere, as Docker uses it to configure all the containers.

I’m sorry I can’t be of more help; my experience with TrueNAS begins with 24.10 .

david · October 30, 2024, 5:31pm

Hm, for me the UUID is populated, but hw trancoding doesn’t work in Plex. Any other ideas?