Docker Apps and UUID issue with NVIDIA GPU after upgrade to 24.10

HoneyBadger · November 1, 2024, 3:11pm

Morning all,

We’re tracking an issue with Apps that impacts NVIDIA users, which we’ve now added to the Known Issues page of our release notes.

Some users who have upgraded to 24.10.0 from a previous version, and who have applications with have NVIDIA GPU allocations, report the error Expected [uuid] to be set for GPU inslot [<some pci slot>] in [nvidia_gpu_selection]) (see NAS-132086).

Users experiencing this error should follow the steps below for a one time fix that should not need to be repeated.

Connect to a shell session and retrieve the UUID for each GPU with the command midclt call app.gpu_choices | jq.

For each application that experiences the error, run midclt call -job app.update APP_NAME '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

Where:

APP_NAME is the name you entered in the application, for example “plex”.
PCI_SLOT is the pci slot identified in the error, for example "0000:2d:00.0”.
GPU_UUID is the UUID matching the pci slot that you retrieved with the above command.

Engineering is digging into the root cause of this - it may be related to the NVIDIA drivers being installed at first-boot, and the apps system isn’t refreshing the UUIDs correctly after the installation.

If you’re having an issue with NVIDIA that isn’t related to the missing UUIDs, please start a separate thread, and ideally include the exact text of the error message. If you have an issue with driver installation, please also include the /var/log/nvidia-installer.log file as an attachment.

a1Deeper · November 1, 2024, 4:00pm

Hi there,

Thanks for the quick update.

When I run the command I get the following output:

root@truenas[~]# midclt call -j app.update plex-new-new '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"0000:2b:00.0": {"use_gpu": true, "uuid": "GPU-f6edf301-01ce-f9db-a641-81b03b714f62"}}}}}}'  
[EBADMSG] Invalid method name
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 365, in on_message
    serviceobj, methodobj = self.middleware._method_lookup(message['method'])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/utils/service/call.py", line 20, in _method_lookup
    raise CallError('Invalid method name', errno.EBADMSG)
middlewared.service_exception.CallError: [EBADMSG] Invalid method name

using the output from the command in this post. did I do something wrong?

HoneyBadger · November 1, 2024, 4:26pm

No, that would be a typo on our side - it should be midclt call -job instead of just -j

Can you give it another shot?

a1Deeper · November 1, 2024, 4:31pm

That did the trick. Thanks!

Jean · November 3, 2024, 12:03am

Thanks! Worked for mee too!

duderuud · November 12, 2024, 8:26am

@HoneyBadger Thanks for this! Fixed my problems…

Dale · November 18, 2024, 12:14am

Fixed my problem after an update from Dragonfish to Eel.

Thanks!

Sawtaytoes · December 18, 2024, 7:35am

Your post fixed my issue. Thank you so much!

It’s a lot easier than when I had to edit the file manually through TrueNAS’s shell.

WiseMonkey · December 19, 2024, 12:33pm

Hello,

I migrated form “TrueNAS Scale Bluefin” to “Scale ElectricEel 24.10.1”, and updated “Frigate”, as a result I lost the GPU usage and had to rework the Frigate configuration using CPU instead of GPU hardware acceleration.

I then tackled the problem :

I first enabled the CPU Intel video in the BIOS, then installed the nVidia “RTX 3050 LP” card as a replacement for an old nVidia “Gigabyte GTX1060” card that used to work for Frigate under TrueNAS Scale 'Bluefin" before I ran successive TrueNAS updates.

( I had kept Bluefin so long because of the RSync change in support. )

I then applied the advised shell commands:

root@TrueNAS-Asus-i5[~]# midclt call app.gpu_choices

{“0000:00:02.0”: {“vendor”: “INTEL”, “description”: “Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller”, “vendor_specific_config”: {}, “pci_slot”: “0000:00:02.0”},

“0000:01:00.0”: {“vendor”: “NVIDIA”, “description”: “NVIDIA GeForce RTX 3050”, “vendor_specific_config”: {“uuid”: “GPU-5579b8ac-5f36-0a45-e8e8-2a3e995dcd5d”}, “pci_slot”: “0000:01:00.0”}}

root@TrueNAS-Asus-i5[~]# midclt call -job app.update frigate ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-5579b8ac-5f36-0a45-e8e8-2a3e995dcd5d”}}}}}}’

It worked for me. Thanks for that piece of information !

Got Frigate to see and make use of nVidia “RTX 3050 LP”.

However, TrueNAS Scale ElectricEel 24.10.1 / “Advanced Settings” / “Isolated GPU device” / Configure / Isolated GPU PCI Ids / GPUs remains with “No options”.

I hope your teams will find the bug, thanks for the efforts.

CajuCLC · December 29, 2024, 4:22pm

Thanks, it worked for me too.

Karmalakas · January 19, 2025, 1:42pm

Instructions in the original post fixed the issue for me on v24.10.0.2
However I just now noticed on v24.10.1, that my GPU is gone again and running midclt call app.gpu_choices | jq returns empty {}. However if I go to GPU isolation advanced settings, I see my nVidia GPU there.

Any advice?

P.S. Reopened the related issue

Update: So it seems “Install nVidia drivers” was unchecked for whatever reason. Checked it again, and Apps service stopped - had to unset and set back the pool. After service restarted, still was experiencing the same. Decided to leave it alone for now, but then almost an hour later was adding another app and noticet nVidia GPU checkbox was back. Went to the app that requered it, but of course it gave an error about unassigned slot. So had to run the command for each app again It seems now it’s working again, but is it only until next update…?

dad_pants · January 27, 2025, 5:37am

Having the same issue currently and it’s super frustrating, even more so because the above fix is not resolving my issue. Some of my apps are totally borked because of it. Probably going to do a fresh install and revert back to the previous iteration tbh.

Karmalakas · January 28, 2025, 7:27pm

Just upgraded to 24.10.2 - so far so good

Christopher_Powell · January 29, 2025, 11:50pm

Hi all. I wanted to share my experience with this issue. I am getting help on discord for it right now and we have made progress. I was asked to share everything we have done in hopes it helps more people with Ollama Docker I am running.

I upgraded from Dragon to ElectricEel-24.10.2

This did not go well to say the least. Like all of you, the Nvidia drivers where missing. I could not install them because apt was disabled. So I reached out on Discord, Github and on the forums. nvidia-smi was not installed but I could see my GPU in the settings under isolate GPU.

I gotta say, everyone (Kryssa and HoneyBadger) on Discord has been gems. Thank you all so much for all your help!

So this is what we did.

Kryssa asked me to run this from the Host OS Shell:
midclt call -job docker.update ‘{“nvidia”: true}’

Then all my apps disappeared. I was worried…
It was all in vain as I restart and everything came back!

nvidia-smi now was working and outputting my Tesla P40.

However, the option in the apps → config → settings still did not show a check for nvidia. Weird!

I also can not assign the GPU to any dockers as the GPU does not list.

Based on something recommended here I ran this on the Host OS:

Only my iGPU was showing up. eakkkk.

HoneyBadger said they where going to look more into this. I have offered to send my card to them for testing and offered access to the system as well.

I will keep everyone up to date on the progress we make.

Update: I noticed my GPU was listed as Isolated for some reason! This was new as it was never set to be isolated! To remove it I have to make sure NOTHING was selected in the drop down in the config menu and save it.

The nvidia drivers checkbox now shows up!

Update 2: The saga contuines…

I did another restart and stop/started the container and the GPU started to show under the docker config!!! I was so excited but alas my hopes where dashed again with this error.

Update 3: Success? Kinda?
So I tried unchecking and rechecking the Nvidia drivers thing a few times. Tried restarting a few times. No success. Then I tried a new instance of Ollama and the gpu worked!!! What!!!

So this part of it seems to be some kind of bug with migrated dockers from the old train maybe? I donno. This just means I need to remake my dockers now.

reiland · February 1, 2025, 6:48pm

I had a slightly different issue, as outlined in this forum post, but this fix worked for me as well.

For anyone else who comes upon this, the error I saw in /var/log/app_lifecycle.log was as follows:

[2025/01/29 12:03:48] (ERROR) app_lifecycle.compose_action():56 - Failed 'up' action for 'jellyfin' app:  Network ix-jellyfin_default  Creating
Network ix-jellyfin_default  Created
Container ix-jellyfin-jellyfin-1  Creating
Container ix-jellyfin-jellyfin-1  Created
Container ix-jellyfin-jellyfin-1  Starting
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: device error: GPU-de46ebef-5d4b-b3c4-dc71-e559845dafd7: unknown device: unknown

Blake_Herbert · February 3, 2025, 9:11am

So for my gpu

{“0000:01:00.0”: {“vendor”: “NVIDIA”, “description”: “NVIDIA GeForce RTX 3070”, “vendor_specific_config”: {“uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}, “pci_slot”: “0000:01:00.0”}}

I input this

‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}}}}}}’

I recieve this error

zsh: command not found: {“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}}}}}}

can someone explain what ive done wrong ive re read it at least 10 times and im scratching my head.

LarsR · February 3, 2025, 9:14am

did you use

midclt call app.gpu_choices ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}}}}}}’

or just

‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}}}}}}’

the second one is invalid because it’s just the arguments for the command, not the command itself, that would be midctl call app.gpu_choices.

Blake_Herbert · February 3, 2025, 9:19am

Both provided errors

first.

truenas_admin@Blake-Plex[~]$ midclt call app.gpu_choices ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0”: {“use_gpu”: true, “uuid”: “GPU-69ac717c-ad02-0d74-de0c-750af03095dc”}}}}}}’
[EFAULT] Too many arguments (expected 0, found 11)
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 211, in call_method
result = await self.middleware.call_with_audit(message[‘method’], serviceobj, methodobj, params, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1529, in call_with_audit
result = await self._call(method, serviceobj, methodobj, params, app=app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1460, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 178, in nf
args, kwargs = clean_and_validate_args(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 148, in clean_and_validate_args
raise CallError(f’Too many arguments (expected {len(nf.accepts)}, found {len(args[args_index:])})')
middlewared.service_exception.CallError: [EFAULT] Too many arguments (expected 0, found 11)

Blake_Herbert · February 3, 2025, 9:28am

Im an idiot, I missed putting in app name. Thank you, it fixed my issue.

LambSauce · February 20, 2025, 3:40pm

I have a same problem can’t get through the second command.
Punched in the first:
midclt call app.gpu_choices
Output:
{“0000:01:00.0”: {“vendor”: “NVIDIA”, “description”: “Quadro P620”, “vendor_specific_config”: {“uuid”: “GPU-912e473a-fe29-b26d-9777-6e170fbf8c78”}, “pci_slot”: “0000:01:00.0”}}

Good so far. But whenever i try to input the second command
(midclt call -job app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:01:00.0” {“use_gpu”: true, “uuid”: “GPU-912e473a-fe29-b26d-9777-6e170fbf8c78”}}}}}}’) i get an error “a dict was expected”.

If i copy the first part
(midclt call -job app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false}}}}’)
it runs fine and removes the gpu assigned to jellyfin which i can add in via the gui.

It completes and everything looks fine but transcoding is still not working, nvidia-smi shows 0% util.