Yet another GPU passthrough - TrueNas Scale

mynkow · July 17, 2024, 6:01am

Hello,

I have been reading and trying for several days already and it is a real pain to get my GPU working and do a passthrough to an emby app. Still failing…

TrueNas Scale - Dragonfish-24.04.2
GPU:

admin@truenas[~]$ nvidia-smi
Tue Jul 16 22:56:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     Off | 00000000:04:00.0 Off |                  N/A |
| 30%   47C    P0              44W / 200W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The nvidia-smi is from the System Shell.

However, when I go to the apps and try to add the GPU there is no option available.

I have tried many things suggested here and there. No luck and I really do not know how to troubleshoot this.

Any advice is appriciated.

Thank you

LarsR · July 17, 2024, 6:09am

Check with sudo k3s kubectl get pods -A if the Nvidia Service pod is stuck in a crash loop

mynkow · July 17, 2024, 6:26am

Yes, I think that is the case. However, I do not understand what this error means:

admin@truenas[~]$ sudo k3s kubectl describe pod  nvidia-device-plugin-daemonset-j9z8l -n kube-system
Name:                 nvidia-device-plugin-daemonset-j9z8l
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      default
Node:                 ix-truenas/10.11.12.115
Start Time:           Tue, 16 Jul 2024 10:11:46 -0700
Labels:               controller-revision-hash=959889769
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ix-net",
                            "interface": "eth0",
                            "ips": [
                                "172.16.0.242"
                            ],
                            "mac": "2a:11:98:97:be:4c",
                            "default": true,
                            "dns": {},
                            "gateway": [
                                "172.16.0.1"
                            ]
                        }]
                      scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   172.16.0.242
IPs:
  IP:           172.16.0.242
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  containerd://42fc19bbf8a4bbe821ca3bfd2c16def1815911f1c1e9e204ae9a900203682d8e
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.13.0
    Image ID:      nvcr.io/nvidia/k8s-device-plugin@sha256:e8343db286ac349f213d7b84e65c0d559d6310e74446986a09b66b21913eef12
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-device-plugin
      --config-file
      /etc/config/nvdefault.yaml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-modeset": no such file or directory: unknown
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 16:00:00 -0800
      Finished:     Tue, 16 Jul 2024 23:22:38 -0700
    Ready:          False
    Restart Count:  158
    Environment:    <none>
    Mounts:
      /etc/config from plugin-config (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7shdj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  plugin-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-device-plugin-config
    Optional:  false
  kube-api-access-7shdj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Created  33m (x153 over 13h)     kubelet  Created container nvidia-device-plugin-ctr
  Normal   Pulled   28m (x154 over 13h)     kubelet  Container image "nvcr.io/nvidia/k8s-device-plugin:v0.13.0" already present on machine
  Warning  BackOff  3m15s (x3603 over 13h)  kubelet  Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-j9z8l_kube-system(16e7eef9-6938-436b-8b1f-ed18240c386c)

mynkow · July 17, 2024, 7:48am

After 13 restarts it decided that the 14th is the lucky try and it booted:

admin@truenas[~]$ sudo k3s kubectl get pods -A                                                                                                                       
[sudo] password for admin: 
NAMESPACE             NAME                                    READY   STATUS            RESTARTS       AGE
kube-system           snapshot-controller-546868dfb4-7cvs2    0/1     TaintToleration   0              15h
ix-filebrowser        filebrowser-5c5dcd4d7c-g8xxn            0/1     TaintToleration   0              15h
kube-system           csi-nfs-node-z42cx                      3/3     Running           24 (14h ago)   23h
kube-system           csi-smb-node-js4zv                      3/3     Running           24 (14h ago)   23h
kube-system           snapshot-controller-546868dfb4-fl8dd    0/1     TaintToleration   0              15h
kube-system           snapshot-controller-546868dfb4-tmwmr    1/1     Running           0              14h
kube-system           snapshot-controller-546868dfb4-nkqbr    1/1     Running           0              14h
kube-system           csi-smb-controller-7fbbb8fb6f-rlbvb     3/3     Running           0              14h
kube-system           coredns-59b4f5bbd5-h5s9k                1/1     Running           0              14h
kube-system           csi-nfs-controller-7b74694749-hkhkh     4/4     Running           0              14h
kube-system           metrics-server-68cf49699b-96d46         1/1     Running           0              14h
ix-filebrowser        filebrowser-5c5dcd4d7c-6hm2w            1/1     Running           0              14h
ix-netdata            netdata-7dbc6d5c8f-wjs6p                1/1     Running           0              14h
cattle-system         cattle-cluster-agent-54479c649c-v9fz8   1/1     Running           0              10m
cattle-fleet-system   fleet-agent-59bc645c4c-6tsl8            1/1     Running           0              10m
cattle-system         rancher-webhook-75f85cd586-mk9nc        1/1     Running           0              9m20s
cattle-system         helm-operation-q8ks7                    0/2     Completed         0              9m45s
cattle-system         helm-operation-wgwzr                    0/2     Completed         0              8m51s
ix-emby               emby-576f59949f-cfbmq                   1/1     Running           0              4m32s
kube-system           nvidia-device-plugin-daemonset-24x8n    1/1     Running           0              49s

When I select the GPU the emby pod throws silimar error.

Waiting
emby	emby/embyserver:4.8.8.0		4	-	
CrashLoopBackOff (back-off 1m20s restarting failed container=emby pod=emby-7cd59bf49c-4pkb9_ix-emby(1bafaa0b-3a10-4381-a35c-70471b54a66c)) | Last state: Terminated with 128: StartError (failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=, nvidia.com/gpu=GPU-bee22552-bb20-c65a-2d38-1fd43f5abc6c: unknown), started: Thu, Jan 1 1970 3:00:00 am, finished: Wed, Jul 17 2024 10:47:43 am

LarsR · July 17, 2024, 8:11am

If i remember correctly, you can only allocate 1 gpu per pod and to 5 pods in total.
Try to set “Allocate nvidia.com/gpu GPU” to 1 and not 5.

mynkow · July 17, 2024, 9:41am

I get the same error no matter what value I choose from the dropdown. I think it is something messed up in the truenas k8s