mynkow
July 17, 2024, 6:01am
1
Hello,
I have been reading and trying for several days already and it is a real pain to get my GPU working and do a passthrough to an emby app. Still failing…
TrueNas Scale - Dragonfish-24.04.2
GPU:
admin@truenas[~]$ nvidia-smi
Tue Jul 16 22:56:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti Off | 00000000:04:00.0 Off | N/A |
| 30% 47C P0 44W / 200W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The nvidia-smi is from the System Shell.
However, when I go to the apps and try to add the GPU there is no option available.
I have tried many things suggested here and there. No luck and I really do not know how to troubleshoot this.
Any advice is appriciated.
Thank you
LarsR
July 17, 2024, 6:09am
2
Check with sudo k3s kubectl get pods -A if the Nvidia Service pod is stuck in a crash loop
mynkow
July 17, 2024, 6:26am
3
Yes, I think that is the case. However, I do not understand what this error means:
admin@truenas[~]$ sudo k3s kubectl describe pod nvidia-device-plugin-daemonset-j9z8l -n kube-system
Name: nvidia-device-plugin-daemonset-j9z8l
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: default
Node: ix-truenas/10.11.12.115
Start Time: Tue, 16 Jul 2024 10:11:46 -0700
Labels: controller-revision-hash=959889769
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "ix-net",
"interface": "eth0",
"ips": [
"172.16.0.242"
],
"mac": "2a:11:98:97:be:4c",
"default": true,
"dns": {},
"gateway": [
"172.16.0.1"
]
}]
scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 172.16.0.242
IPs:
IP: 172.16.0.242
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: containerd://42fc19bbf8a4bbe821ca3bfd2c16def1815911f1c1e9e204ae9a900203682d8e
Image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:e8343db286ac349f213d7b84e65c0d559d6310e74446986a09b66b21913eef12
Port: <none>
Host Port: <none>
Command:
nvidia-device-plugin
--config-file
/etc/config/nvdefault.yaml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-modeset": no such file or directory: unknown
Exit Code: 128
Started: Wed, 31 Dec 1969 16:00:00 -0800
Finished: Tue, 16 Jul 2024 23:22:38 -0700
Ready: False
Restart Count: 158
Environment: <none>
Mounts:
/etc/config from plugin-config (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7shdj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
plugin-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-device-plugin-config
Optional: false
kube-api-access-7shdj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 33m (x153 over 13h) kubelet Created container nvidia-device-plugin-ctr
Normal Pulled 28m (x154 over 13h) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.13.0" already present on machine
Warning BackOff 3m15s (x3603 over 13h) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-j9z8l_kube-system(16e7eef9-6938-436b-8b1f-ed18240c386c)
mynkow
July 17, 2024, 7:48am
4
After 13 restarts it decided that the 14th is the lucky try and it booted:
admin@truenas[~]$ sudo k3s kubectl get pods -A
[sudo] password for admin:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system snapshot-controller-546868dfb4-7cvs2 0/1 TaintToleration 0 15h
ix-filebrowser filebrowser-5c5dcd4d7c-g8xxn 0/1 TaintToleration 0 15h
kube-system csi-nfs-node-z42cx 3/3 Running 24 (14h ago) 23h
kube-system csi-smb-node-js4zv 3/3 Running 24 (14h ago) 23h
kube-system snapshot-controller-546868dfb4-fl8dd 0/1 TaintToleration 0 15h
kube-system snapshot-controller-546868dfb4-tmwmr 1/1 Running 0 14h
kube-system snapshot-controller-546868dfb4-nkqbr 1/1 Running 0 14h
kube-system csi-smb-controller-7fbbb8fb6f-rlbvb 3/3 Running 0 14h
kube-system coredns-59b4f5bbd5-h5s9k 1/1 Running 0 14h
kube-system csi-nfs-controller-7b74694749-hkhkh 4/4 Running 0 14h
kube-system metrics-server-68cf49699b-96d46 1/1 Running 0 14h
ix-filebrowser filebrowser-5c5dcd4d7c-6hm2w 1/1 Running 0 14h
ix-netdata netdata-7dbc6d5c8f-wjs6p 1/1 Running 0 14h
cattle-system cattle-cluster-agent-54479c649c-v9fz8 1/1 Running 0 10m
cattle-fleet-system fleet-agent-59bc645c4c-6tsl8 1/1 Running 0 10m
cattle-system rancher-webhook-75f85cd586-mk9nc 1/1 Running 0 9m20s
cattle-system helm-operation-q8ks7 0/2 Completed 0 9m45s
cattle-system helm-operation-wgwzr 0/2 Completed 0 8m51s
ix-emby emby-576f59949f-cfbmq 1/1 Running 0 4m32s
kube-system nvidia-device-plugin-daemonset-24x8n 1/1 Running 0 49s
When I select the GPU the emby pod throws silimar error.
Waiting
emby emby/embyserver:4.8.8.0 4 -
CrashLoopBackOff (back-off 1m20s restarting failed container=emby pod=emby-7cd59bf49c-4pkb9_ix-emby(1bafaa0b-3a10-4381-a35c-70471b54a66c)) | Last state: Terminated with 128: StartError (failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=GPU-bee22552-bb20-c65a-2d38-1fd43f5abc6c: unknown), started: Thu, Jan 1 1970 3:00:00 am, finished: Wed, Jul 17 2024 10:47:43 am
LarsR
July 17, 2024, 8:11am
5
If i remember correctly, you can only allocate 1 gpu per pod and to 5 pods in total.
Try to set “Allocate nvidia.com/gpu GPU” to 1 and not 5.
mynkow
July 17, 2024, 9:41am
6
I get the same error no matter what value I choose from the dropdown. I think it is something messed up in the truenas k8s