Nvidia-device-plugin-daemonset CrashLoopBackOff

Travis · May 9, 2024, 1:20am

I am trying to get nvidia acceleration working with Frigate and when I enable NVIDA GPU support, the app fails to start because

0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

Digging further, the logs from nvidia-device-plugin-daemonset-gbdcp show that it is dying:

2024/05/09 01:12:13 Starting FS watcher.
2024/05/09 01:12:13 Starting OS watcher.
2024/05/09 01:12:13 Starting Plugins.
2024/05/09 01:12:13 Loading configuration.
2024/05/09 01:12:13 Updating config with default resource matching patterns.
2024/05/09 01:12:13
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 5
        }
      ]
    }
  }
}
2024/05/09 01:12:13 Retreiving plugins.
2024/05/09 01:12:13 Detected NVML platform: found NVML library
2024/05/09 01:12:13 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/05/09 01:12:14 Starting GRPC server for 'nvidia.com/gpu'
2024/05/09 01:12:14 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/05/09 01:12:14 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/05/09 01:13:22 Received signal "terminated", shutting down.
2024/05/09 01:13:22 Stopping plugins.
2024/05/09 01:13:22 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock

I am new to TrueNAS, and I have nvidia support working with Plex, but I can’t do 2 different apps. I know my card can do 4 streams at once because I had it working in Unraid, but I am a bit stumped.

Travis · May 9, 2024, 11:20am

I was able to make SOME headway (at least for one container) by running

midclt call boot.update_initramfs

and then rebooting. Still looks like I cannot access the SMI from multiple containers, but that may be expected behavior?