GPU not working

Truenas SCALE I have a Nvidia 3050. This was working with the plex App. After updating the plex app which was really glitchy, I had to unassign the nvidia gpu to let it upgrade otherwise it just said there is no GPU. Then I rebooted the system and now nvidia-smi no longer sees the 3050. lspci shows it like normal. Everything I read online keeps telling me not to do anything in the CLI and the driver should just auto work.

What do I do to get the GPU working again? Nothing has changed.

Missing basic info like hardware and software version

Which version of Plex App?

Found a similat ticket:

Description

This is for getting NVIDIA gpu working in the Plex container of 24.04.1.1

On the “GPU Resource” dropdown, the only dropdown is “Allocate 0 nvidia.com/gpu GPU”. I know the GPU is active via lcpsi:question_mark: and “nvidia-smi”? According to the truenas website site, I should ensure my gpu is unallocated (in the settings) which I have done, and pulled a reboot on the server.

SOLUTION

It appears on a fresh install of Plex the drop down options are there.

However, if the application is already installed, and you try to allocate additional GPU, the drop down doesn’t work.

The drop down was working. Now all of a sudden it’s not and also my issue is different because now nvidia-smi isn’t working and it was before. Sometimes the dropdown does work and then it fails to schedule forever because it says no nodes have a GPU.

IDK why but truenas is becoming unresponsive entirely and I cannot access it from the GUI and I have to keep rebooting it. Looks like I am going to have to do that again before I can get the plex version but it should be the plexpass latest image.

Dragonfish-24.04.1.1
Supermicro SYS- 4028GR-TRT 4U
2X Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
440.8GiB total available (ECC)
LSI 9200-8E
Supermicro 45 Bay JBOD Expansion Server Shelf 847E16-RJBOD1
eight ebay WL 22TB OEM Enterprise SATA 7200RPM HDD Comparable to ST22000NM001E so essentially 8 22TB Seagate exos drives configured in one zfs pool and 9 8TB drives in the same pool.

App Version: 1.40.3.8555

Chart Version: 2.0.7

As fidddly as this is rebooting one more time has gotten the GPU redetected and working. Something is really glitchy here. Even nvidia-smi is working. I must have rebooted this thing 3 times in the last day so I cannot tell you why it worked this time…

Also I will point out I only have 1 gpu but it will let me select up to 5 for some reason.

Cleary there is an instability somewhere.

There’s a stack of software for making these Apps work.
Hardware & BIOS,
Linux Kernel
Nvidia Driver
Kubenetes
TrueNAS Midddleware
App

They all gradually improve with bug-fixes and updates. The fixes are easier and faster if the problem reproduces easily on multiple systems.

Glad it’s sorted, but I hope the next updates make it more robust.

I had some issues with my 3050 - for me the fix was a clean install of Scale & re-uploading the config afterwards. Been stable since.

That’s great, but this cannot be the official solution is to keep reinstalling it and rebooting it.

I can only reinstall and reboot so many times. What is this windows?

I mean - community forms man, I’m not officially related to iX in any way short of using their products.

For an official response I’d recommend filling out a bug report; though you did get Captain respond saying that they hope next update is better.

Mhm may the force be with us. I wouldn’t hold my breath until I see a jira ticket. I’ll file a bug report but I doubt “this thing has issues” is going to get much traction in the next sprint planning.

1 Like

Any advise? My system keeps crapping out and going into zombie mode and I need to reboot it. The logs are FULLL of gpu errors. The whole time the system is running k3s is complaining about the GPU. In the meantime I am going to try to disable the GPU to verify if this solves the crashing.

Jun 24 18:37:08 truenas k3s[11257]: I0624 18:37:08.545040 11257 scope.go:115] “RemoveContainer” containerID=“58d183ff47f78fbe4231de08adca9b4eee27c7dfd8c811304546ac1a2b739240”
Jun 24 18:37:13 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:13 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:15 truenas k3s[11257]: {“level”:“warn”,“ts”:“2024-06-24T18:37:15.753-0400”,“logger”:“etcd-client”,“caller”:“v3@v3.5.7-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,>
Jun 24 18:37:17 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:17 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:18 truenas k3s[11257]: I0624 18:37:18.246498 11257 scope.go:115] “RemoveContainer” containerID=“58d183ff47f78fbe4231de08adca9b4eee27c7dfd8c811304546ac1a2b739240”
Jun 24 18:37:21 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:21 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:25 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:25 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:29 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:29 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:32 truenas k3s[11257]: W0624 18:37:32.593994 11257 manager.go:1174] Failed to process watch event {EventType:0 Name:/system.slice/kubepods-besteffort-pod9aac1ba3_4f42_4cbd_a54b_dfa2f5a66a>
Jun 24 18:37:33 truenas kernel: NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jun 24 18:37:33 truenas kernel: NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
Jun 24 18:37:46 truenas k3s[11257]: {“level”:“warn”,“ts”:“2024-06-24T18:37:46.914-0400”,“logger”:"etcd-cli

ul 16 09:06:31 truenas k3s[11254]: time=“2024-07-16T09:06:31-04:00” level=info msg=“COMPACT deleted 332 rows from 339 revisions in 25.241735ms - compacted to 3273169/3274169”
Jul 16 09:06:38 truenas k3s[11254]: I0716 09:06:38.161433 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:06:38 truenas k3s[11254]: E0716 09:06:38.161965 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:06:42 truenas k3s[11254]: {“level”:“warn”,“ts”:“2024-07-16T09:06:42.260-0400”,“logger”:“etcd-client”,“caller”:“v3@v3.5.7-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000a4e380/kine.sock”,“attempt”:0,“error”:“rpc error: code = Unknown desc = no such table: dbstat”}
Jul 16 09:06:52 truenas k3s[11254]: I0716 09:06:52.170424 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:06:52 truenas k3s[11254]: E0716 09:06:52.170961 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:07:05 truenas k3s[11254]: I0716 09:07:05.160538 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:07:05 truenas k3s[11254]: E0716 09:07:05.161089 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:07:17 truenas k3s[11254]: {“level”:“warn”,“ts”:“2024-07-16T09:07:17.742-0400”,“logger”:“etcd-client”,“caller”:“v3@v3.5.7-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000a4e380/kine.sock”,“attempt”:0,“error”:“rpc error: code = Unknown desc = no such table: dbstat”}
Jul 16 09:07:20 truenas k3s[11254]: I0716 09:07:20.161284 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:07:20 truenas k3s[11254]: E0716 09:07:20.161798 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:07:31 truenas k3s[11254]: I0716 09:07:31.161281 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:07:31 truenas k3s[11254]: E0716 09:07:31.161780 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:07:45 truenas k3s[11254]: I0716 09:07:45.169555 11254 scope.go:115] “RemoveContainer” containerID=“c49c4e5addf1388633df311e3349642163dc86a4572f54de37c8ef03bd40735e”
Jul 16 09:07:45 truenas k3s[11254]: E0716 09:07:45.170032 11254 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "nvidia-device-plugin-ctr" with CrashLoopBackOff: "back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-msvff_kube-system(c3cecb55-b447-4bf1-a00b-a5bfe8fb0318)"” pod=“kube-system/nvidia-device-plugin-daemonset-msvff” podUID=c3cecb55-b447-4bf1-a00b-a5bfe8fb0318
Jul 16 09:16:22 truenas k3s[11254]: {“level”:“warn”,“ts”:“2024-07-16T09:16:22.948-0400”,“logger”:“etcd-client”,“caller”:“v3@v3.5.7-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc000a4e380/kine.sock”,“attempt”:0,“error”:“rpc error: code = Unknown desc = no such table: dbstat”}

I know this ist not a direct solution, but if you want a hassle free life where YOU have control over app ugrades etc

  • run jellyfin/plex etc in a VM or in a sandbox
  • wait for Truenas Electric Eel and run it with Docker

Sigh this is the solution I was trying to avoid but if the gpu support is gimped I will do it…

The issue is even without assigning the GPU to the container the system is still throwing GPU errors in the logs…