Nvidia GPU not showing for apps

James_Green · July 4, 2024, 8:56am

Hi. I’ve just built a NAS with Dragonfish-24.04.1.1 and most things work. The bit I’m having trouble with is my GPU (nvidia p2000) - it is not available as a resource to pick for apps like Plex.

Screenshot 2024-07-04 094941

It sounds awfully like this post - I have no VMs, just the TrueNAS official Plex app installed.

I ran through the commands in the above post. Interestingly I can’t run k3s kubectl show pods -n kube-system even as sudo - I get told that show isn’t an available command. However, the hardware definitely shows and is available for isolation if I wanted to.

Screenshot 2024-07-04 095459

The k3s thing - I personally do Docker and use Amazon ECS for enterprise workloads, I don’t do K8S as a rule (far too overcomplicated is my personal view).

Anyway - what am I missing here? Or is this a known bug at present?

LarsR · July 4, 2024, 9:15am

Have you checked the output of nvdia-smi from the truenas shell? If it says the card could not communicate with the driver tne your card is not supported by the nvidia-driver scale uses.
Other then that you could check with k3s kubectl get pods -A if the nvidia system pod is stuck in a crash loop.

James_Green · July 4, 2024, 1:02pm

truenas% nvidia-smi
Thu Jul  4 13:58:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P2000                   Off | 00000000:04:00.0 Off |                  N/A |
| 49%   39C    P8               6W /  75W |      0MiB /  5120MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
truenas%

And

truenas% sudo k3s kubectl get pods -A
[sudo] password for jgreen: 
NAMESPACE      NAME                                   READY   STATUS             RESTARTS        AGE
ix-tailscale   tailscale-78fcb6bc64-hfk98             0/1     Completed          0               23h
ix-sonarr      sonarr-6995b49df7-flsct                0/1     Completed          0               23h
kube-system    snapshot-controller-546868dfb4-zhnc7   0/1     Error              0               29h
kube-system    csi-smb-controller-7fbbb8fb6f-57fp2    3/3     Running            3 (4h2m ago)    29h
kube-system    csi-smb-node-shws2                     3/3     Running            3 (4h2m ago)    29h
kube-system    csi-nfs-node-bfsbs                     3/3     Running            3 (4h2m ago)    29h
kube-system    csi-nfs-controller-7b74694749-6hqgx    4/4     Running            4 (4h2m ago)    29h
kube-system    snapshot-controller-546868dfb4-88glz   1/1     Running            1 (4h2m ago)    29h
kube-system    coredns-59b4f5bbd5-x9bs7               1/1     Running            1 (4h2m ago)    29h
kube-system    snapshot-controller-546868dfb4-jdmw9   1/1     Running            0               4h2m
ix-metube      metube-5fd77d68f4-d76vj                1/1     Running            1 (4h2m ago)    22h
ix-sonarr      sonarr-6995b49df7-vkhzg                1/1     Running            0               4h2m
ix-tailscale   tailscale-78fcb6bc64-plnj2             1/1     Running            0               4h2m
ix-plex        plex-94cf4454c-5dpjt                   1/1     Running            0               3h58m
kube-system    nvidia-device-plugin-daemonset-k58cb   0/1     CrashLoopBackOff   41 (112s ago)   4h1m
truenas%

Clearly a container handling the nvidia driver (?) is repeatedly crashing, as others are reporting.

Log file:

truenas% sudo cat 43.log 
2024-07-04T14:10:24.134548872+01:00 stderr F 2024/07/04 13:10:24 Starting FS watcher.
2024-07-04T14:10:24.134588677+01:00 stderr F 2024/07/04 13:10:24 Starting OS watcher.
2024-07-04T14:10:24.134790434+01:00 stderr F 2024/07/04 13:10:24 Starting Plugins.
2024-07-04T14:10:24.134800283+01:00 stderr F 2024/07/04 13:10:24 Loading configuration.
2024-07-04T14:10:24.135017028+01:00 stderr F 2024/07/04 13:10:24 Updating config with default resource matching patterns.
2024-07-04T14:10:24.135037987+01:00 stderr F 2024/07/04 13:10:24 
2024-07-04T14:10:24.135056232+01:00 stderr F Running with config:
2024-07-04T14:10:24.135066391+01:00 stderr F {
2024-07-04T14:10:24.135076219+01:00 stderr F   "version": "v1",
2024-07-04T14:10:24.135087029+01:00 stderr F   "flags": {
2024-07-04T14:10:24.135096196+01:00 stderr F     "migStrategy": "none",
2024-07-04T14:10:24.135105043+01:00 stderr F     "failOnInitError": true,
2024-07-04T14:10:24.135113879+01:00 stderr F     "nvidiaDriverRoot": "/",
2024-07-04T14:10:24.135123808+01:00 stderr F     "gdsEnabled": false,
2024-07-04T14:10:24.135138876+01:00 stderr F     "mofedEnabled": false,
2024-07-04T14:10:24.135149125+01:00 stderr F     "plugin": {
2024-07-04T14:10:24.135158112+01:00 stderr F       "passDeviceSpecs": false,
2024-07-04T14:10:24.135167009+01:00 stderr F       "deviceListStrategy": "envvar",
2024-07-04T14:10:24.135175845+01:00 stderr F       "deviceIDStrategy": "uuid"
2024-07-04T14:10:24.135184622+01:00 stderr F     }
2024-07-04T14:10:24.135193508+01:00 stderr F   },
2024-07-04T14:10:24.135202285+01:00 stderr F   "resources": {
2024-07-04T14:10:24.135217884+01:00 stderr F     "gpus": [
2024-07-04T14:10:24.135227141+01:00 stderr F       {
2024-07-04T14:10:24.135235968+01:00 stderr F         "pattern": "*",
2024-07-04T14:10:24.135244774+01:00 stderr F         "name": "nvidia.com/gpu"
2024-07-04T14:10:24.135253881+01:00 stderr F       }
2024-07-04T14:10:24.135262708+01:00 stderr F     ]
2024-07-04T14:10:24.135271574+01:00 stderr F   },
2024-07-04T14:10:24.135280361+01:00 stderr F   "sharing": {
2024-07-04T14:10:24.13529585+01:00 stderr F     "timeSlicing": {
2024-07-04T14:10:24.135304877+01:00 stderr F       "resources": [
2024-07-04T14:10:24.135313713+01:00 stderr F         {
2024-07-04T14:10:24.13532255+01:00 stderr F           "name": "nvidia.com/gpu",
2024-07-04T14:10:24.135331366+01:00 stderr F           "devices": "all",
2024-07-04T14:10:24.135340163+01:00 stderr F           "replicas": 5
2024-07-04T14:10:24.135348959+01:00 stderr F         }
2024-07-04T14:10:24.135357746+01:00 stderr F       ]
2024-07-04T14:10:24.135374357+01:00 stderr F     }
2024-07-04T14:10:24.135383364+01:00 stderr F   }
2024-07-04T14:10:24.1353922+01:00 stderr F }
2024-07-04T14:10:24.135401107+01:00 stderr F 2024/07/04 13:10:24 Retreiving plugins.
2024-07-04T14:10:24.135415985+01:00 stderr F 2024/07/04 13:10:24 Detected NVML platform: found NVML library
2024-07-04T14:10:24.135431534+01:00 stderr F 2024/07/04 13:10:24 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024-07-04T14:10:24.672101541+01:00 stderr F 2024/07/04 13:10:24 Starting GRPC server for 'nvidia.com/gpu'
2024-07-04T14:10:24.672216446+01:00 stderr F 2024/07/04 13:10:24 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024-07-04T14:10:24.673629289+01:00 stderr F 2024/07/04 13:10:24 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024-07-04T14:11:40.828161212+01:00 stderr F 2024/07/04 13:11:40 Received signal "terminated", shutting down.
2024-07-04T14:11:40.82820824+01:00 stderr F 2024/07/04 13:11:40 Stopping plugins.
2024-07-04T14:11:40.828235941+01:00 stderr F 2024/07/04 13:11:40 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock

LarsR · July 4, 2024, 1:14pm

When i was still using the buildin apps system i also had the nvidia pod stuck in a crashloop. If i remember correctly my fix was to disable the gpu support for k3s in advanced options, reboot, wait a couple of minutes, enable gpu support for k3s, check with k3s kubectl get pods -A if the pod started correctly and if yes assign the gpu to my apps.

fayelund · July 4, 2024, 2:43pm

I have the same problem, hasn’t been acknowledged nor fixed by ixsystems yet.

HoneyBadger · July 4, 2024, 4:00pm

What’s the open/active Jira ticket for this? I see a few that seem to indicate it may be an upstream Kubernetes issue, but I’m not sure which one is being referenced or tracked here.

@Airekris tagging you here as I recall you from the other thread - if you have one as well please let me know.

Airekris · July 4, 2024, 5:15pm

@HoneyBadger I completely forgot about it. Here it is

https://ixsystems.atlassian.net/browse/NAS-129701

You could also reference my findings on my post.

HoneyBadger · July 4, 2024, 6:01pm

If someone with this problem can pull their nvidia container logs with:

sudo k3s kubectl logs -p pod/nvidia-device-plugin-daemonset-XXXXX --namespace=kube-system

where you have to replace the XXXXX with your own unique-ID obtained from the sudo k3s kubectl get pods -A command, (so @James_Green you would use k58cb there)

Can you please attach them to the ticket? System debugs (System Settings → Advanced → Save Debug) can also be uploaded through the Private File Upload link in the comments of the ticket.

https://ixsystems.atlassian.net/browse/NAS-129701

James_Green · July 4, 2024, 8:31pm

FWIW I’ve added a ticket, but I can equally see others having filed what appear to be matching tickets. I’d love to figure out now whether this is a TrueNAS issue, or an upstream matter.

@HoneyBadger I can, it appears to be the same as the log I already posted:

truenas% sudo k3s kubectl logs -p pod/nvidia-device-plugin-daemonset-k58cb --namespace=kube-system
2024/07/04 20:22:39 Starting FS watcher.
2024/07/04 20:22:39 Starting OS watcher.
2024/07/04 20:22:39 Starting Plugins.
2024/07/04 20:22:39 Loading configuration.
2024/07/04 20:22:39 Updating config with default resource matching patterns.
2024/07/04 20:22:39 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 5
        }
      ]
    }
  }
}
2024/07/04 20:22:39 Retreiving plugins.
2024/07/04 20:22:39 Detected NVML platform: found NVML library
2024/07/04 20:22:39 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/04 20:22:39 Starting GRPC server for 'nvidia.com/gpu'
2024/07/04 20:22:39 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/07/04 20:22:39 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/07/04 20:23:39 Received signal "terminated", shutting down.
2024/07/04 20:23:39 Stopping plugins.
2024/07/04 20:23:39 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
truenas%

fayelund · July 5, 2024, 5:08am

Here is my output, also given in ticket([NAS-128963] - iXsystems TrueNAS Jira):

2024/07/05 05:03:57 Starting FS watcher.
2024/07/05 05:03:57 Starting OS watcher.
2024/07/05 05:03:57 Starting Plugins.
2024/07/05 05:03:57 Loading configuration.
2024/07/05 05:03:57 Updating config with default resource matching patterns.
2024/07/05 05:03:57
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 5
        }
      ]
    }
  }
}
2024/07/05 05:03:57 Retreiving plugins.
2024/07/05 05:03:57 Detected NVML platform: found NVML library
2024/07/05 05:03:57 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/05 05:03:57 Starting GRPC server for 'nvidia.com/gpu'
2024/07/05 05:03:57 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/07/05 05:03:57 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/07/05 05:05:16 Received signal "terminated", shutting down.
2024/07/05 05:05:16 Stopping plugins.
2024/07/05 05:05:16 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock

James_Green · July 5, 2024, 1:56pm

My system restarted (seemingly by itself) this morning. Gave me an opportunity to check the logs at the start. I’m unclear whether these all relate to the problem:

Jul 05 10:22:34 truenas k3s[5268]: I0705 10:22:34.395214    5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:22:34 truenas k3s[5268]: E0705 10:22:34.410783    5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:22:35 truenas k3s[5268]: I0705 10:22:35.874063    5268 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=17910588-9606-49ba-8d91-0aa91afd0022 path="/var/lib/kubelet/pods/17910588-9606-49ba-8d91-0aa91afd0022/volumes"
Jul 05 10:22:38 truenas k3s[5268]: I0705 10:22:38.180621    5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="7f3cdecd3c4fafcfd2193365411d76cc2f47fd828b07c0805668693d1a4c7a6a"
Jul 05 10:22:41 truenas k3s[5268]: I0705 10:22:41.906378    5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.021335    5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: not found" containerID="264>
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.022290    5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 26477967f237bf60eefbcd>
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.022834    5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 26477967f237bf60eefbcd>
Jul 05 10:22:42 truenas k3s[5268]: I0705 10:22:42.193659    5268 scope.go:115] "RemoveContainer" containerID="b0ab4a72cc61abcc041f826c960294764045801b1966b43755728448cd3d7cf9"
Jul 05 10:22:42 truenas k3s[5268]: I0705 10:22:42.193927    5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.194366    5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:45 truenas k3s[5268]: I0705 10:22:45.524474    5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:45 truenas k3s[5268]: E0705 10:22:45.524961    5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:46 truenas k3s[5268]: I0705 10:22:46.988786    5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:46 truenas k3s[5268]: E0705 10:22:46.989225    5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:55 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:22:55.063+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:22:57 truenas k3s[5268]: I0705 10:22:57.868834    5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:23:18 truenas k3s[5268]: I0705 10:23:18.779081    5268 scope.go:115] "RemoveContainer" containerID="b8c1ea6b3918c6c9c599abdd9a422009c277731fe2a46893e7b033e244c77bdd"
Jul 05 10:23:19 truenas k3s[5268]: I0705 10:23:19.270719    5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="b8c1ea6b3918c6c9c599abdd9a422009c277731fe2a46893e7b033e244c77bdd"
Jul 05 10:23:19 truenas k3s[5268]: E0705 10:23:19.278882    5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:23:23 truenas k3s[5268]: I0705 10:23:23.286862    5268 scope.go:115] "RemoveContainer" containerID="906da8bc2ca0516526e33a15c83a0b1332ffd183586c793200a30e1b19197dd6"
Jul 05 10:23:28 truenas k3s[5268]: I0705 10:23:28.299271    5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="304c48b1636421a8b473cdd18b01763b868e0ee77c749931706fe3e941a1acad"
Jul 05 10:23:29 truenas k3s[5268]: E0705 10:23:29.217156    5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 10s restarting failed container=nvidia-device-plugin-c>
Jul 05 10:23:29 truenas k3s[5268]: I0705 10:23:29.302552    5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"
Jul 05 10:23:29 truenas k3s[5268]: E0705 10:23:29.302801    5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 10s restarting failed container=nvidia-device-plugin-c>
Jul 05 10:23:30 truenas k3s[5268]: I0705 10:23:30.304843    5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"
Jul 05 10:23:33 truenas k3s[5268]: I0705 10:23:33.149925    5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:23:35 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:23:35.656+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:24:18 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:24:18.613+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:24:40 truenas k3s[5268]: E0705 10:24:40.880043    5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:24:45 truenas k3s[5268]: I0705 10:24:45.465778    5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"

James_Green · July 5, 2024, 3:24pm

Events:

jgreen@truenas ~ % sudo k3s kubectl get events --all-namespaces
[sudo] password for jgreen: 
NAMESPACE     LAST SEEN   TYPE      REASON           OBJECT                                     MESSAGE
kube-system   58m         Normal    SandboxChanged   pod/nvidia-device-plugin-daemonset-k58cb   Pod sandbox changed, it will be killed and re-created.
kube-system   58m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.34/16] from ix-net
kube-system   52m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.35/16] from ix-net
kube-system   45m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.36/16] from ix-net
kube-system   39m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.37/16] from ix-net
kube-system   34m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.38/16] from ix-net
kube-system   27m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.39/16] from ix-net
kube-system   21m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.40/16] from ix-net
kube-system   16m         Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.41/16] from ix-net
kube-system   9m54s       Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.42/16] from ix-net
kube-system   8m38s       Warning   BackOff          pod/nvidia-device-plugin-daemonset-k58cb   Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-k58cb_kube-system(cbd2f844-0058-43c1-b019-126ae1906411)
kube-system   3m38s       Normal    Killing          pod/nvidia-device-plugin-daemonset-k58cb   Stopping container nvidia-device-plugin-ctr
kube-system   3m26s       Normal    AddedInterface   pod/nvidia-device-plugin-daemonset-k58cb   Add eth0 [172.16.2.43/16] from ix-net

James_Green · July 8, 2024, 8:14pm

@LarsR you said you got this working. I’ve been into Apps > Settings and disabled the GPU, Saved and watched the pod (?) disappear. I then went back in and switched it back on. The pod started and appeared fine - I even had my nvidia GPUs to allocate to the app. Then the pod died again. Did yours stay running more than a few minutes?

LarsR · July 8, 2024, 8:26pm

I did a reboot inbetween. But then it says on for days

James_Green · July 11, 2024, 8:27am

A quick follow-up. I followed the advice from @LarsR and disabled GPU, then rebooted, then re-enabled and rebooted.

Container lasted a few minutes and GPUs were selectable for Plex. Then, the container died and that was that again. I filed a bug (frankly more of a support request) here with the nvidia driver (I doubt it’s their fault) so if anyone with more expertise would like to assist understanding matters I’d be grateful.

James_Green · July 21, 2024, 11:12am

This is not going to be investigated further by the engineers at ix-systems. It appears the problem may be within the K8S subsystems but that’s a complex area and being removed for the next major release of TrueNAS Scale.

There’s a good chance it will work using Docker directly, and there is Jailmaker as an option for accessing this today. I might even have a go myself.

James_Green · July 27, 2024, 1:52pm

Following up further.

I have installed Jailmaker. I used Jailmaker to build a container named docker. I basically followed the procedure in the video on YouTube posted above. Except I used Plex not Jellyfin. Plex can see my nVidia P2000!

Use the shell to access TrueNas scale and confirm you have a /dev/dri/renderD128 (that’s mine, yours may be different). That’s your graphics card recognised by the OS.

Ensure with Jailmaker you volume-mount /dev/dri into your jail:

gpu_passthrough_nvidia=1
systemd_nspawn_user_args=--network-macvlan=enp3s0
        --resolv-conf=bind-host
        --system-call-filter='add_key keyctl bpf'
        --bind='/mnt/dataset-name-here/docker/data:/mnt/data'
        --bind='/mnt/dataset-name-here/docker/stacks:/opt/stacks'
        --bind='/dev/dri:/dev/dri'

(Clearly modify to your own needs, the point is the `/dev/dri` being passed through.)

Ensure your docker containers in-turn volume-mount /dev/dri into your containers, for example (in Docker Compose):

    volumes:
      - /mnt/data/plex/config:/config
      - /mnt/data/plex/transcode:/transcode
      - /mnt/data/plex/data:/data
      - /dev/dri:/dev/dri

Finally, with fingers crossed, select your server in Plex and visit the Transcoder page. Hopefully your device will be listed.

The above is clearly something that can be followed by those comfortable with shell access and the editors you use like vim and nano. Hopefully with the next major TrueNAS Scale release the apps infrastructure will make this much simpler for folk needing pure web interface controls.

James_Green · July 28, 2024, 4:19pm

Another follow-up. Turns out the instructions should work for Jellyfin, but are not sufficient for Plex. Plex will show your hardware, but not use it. Switch on debugging in Plex and you’ll find you’re missing the driver.

The solution is to install the nvidia-container-runtime for Docker to use when launching containers. See this guide.

I finally see the magical “(hw)” when transcoding a stream in Plex and my CPU is no longer used!

Rocketplanner83 · August 7, 2024, 1:44pm

This is what I did as well and my p2200 is working now.

Sawtaytoes · September 13, 2024, 5:34am

How are you running nvidia-smi? Did you install it somehow? Or is it not available in Electric Eel?

I just did a fresh install and wanted to test with the latest App infrastructure, so I updated to Electric Eel. Should I downgrade and then my NVIDIA RTX 4060 will potentially work?