What's up with the GPU weird random behaviour?

Karmalakas · August 12, 2024, 9:01am

So I’m on Dragonfish-24.04.2 with GTX 1070Ti and been playing around for a week or two. Probably the most annoyng issue I have, is random GPU detection. I’ve searched a lot, but couldn’t find any reasonable explanations.

When I want to install any app, that allows GPU selection, there are 3 dropdowns. I have nVidia GPU enabled. Usually nVidia dropdown shows only the 0 option to select:

Screenshot 2024-08-11 153843

But if you keep refreshing and if you’re lucky, occasionally you get even 5 of them:

Screenshot 2024-08-11 172939

You get that literally by just refreshing the page countless times…

But that also doesn’t make sense, because there’s only one GPU on my machine. If I select anything more than 1, I get an error. I’ve seen on the net some say it’s the core count, which you assign to a specific app, some say it’s actual GPU count… I have 2 apps using GPU - Plex and Jellyfin - both have assigned 1 from GPU selector and both use GPU as expected when transcoding media

If you fanally manage to get GPU assigned to the apps, deployment is a nightmare…

After app service or system restart, when all apps are deploying, some of them usually get:

Allocate failed due to no healthy devices present; cannot allocate unhealthy devices nvidia.com/gpu, which is unexpected

But this happens for the apps, which don’t even have any GPU assigned. The ones which have GPU assigned, sometimes get:

0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod…

Sometimes (very rarely) stopping and restarting deployment helps. What usually (still not always) helps - in Apps advanced settings disabling and re-enabling GPU support… Then apps with assigned GPU almost always deploy successfully. Ones without the GPU still might get the Allocate failed error. In that case either re-deployment helps or just leaving in the deployment state for ~10 minutes until it manages to deploy…

Would really appreciate any insights about this behaviour. I find myself spending much more time disabling/enabling services and constantly re-deploying apps than actually setting up my environment…

Farout · August 12, 2024, 9:18am

This and similar problems have been reported basically since scale was released.
And it hasnt gotten any better.

I suggest:

running the app in a sandbox/VM
wait for Electric Eel which will switch to Docker, and hopefully will resolve such problems

Karmalakas · August 12, 2024, 9:25am

I’m still very new to Linux and TrueNAS (just a couple of weeks playing), so VMs might be a bit too much for now. Also, if I understand correctly, each VM would require some RAM allocation, which I don’t have much - only 32GB (old MoBo, which used to be my PC and that’s max it supports). I guess I’ll wait

Farout · August 12, 2024, 9:30am

Sandboxes are more resource friendly.

WereCatf · August 12, 2024, 9:32am

To be quite frank, you don’t need that much RAM for a VM that’s just running Plex and Jellyfin. 4GiB would be fine.

LarsR · August 12, 2024, 9:41am

TLDR: there’s a problem with the kubernetes nvidia pod getting stuck in a crash loop and since kubernetes will be removed in ca. 2 Months and there’s apparently no fix yet in kubernetes iX won’t (can’t until upstream fixes it) fix it because it’ll get removed anyway.

Had the same problem with the build in apps sytem, but never since i’ve moved to a jail with plain docker.