FIX: NVIDIA Pod in Crash-Loop on TrueNAS Scale k3s Setup

REDUNDANT · September 26, 2024, 10:20am

I recently encountered a persistent issue where my NVIDIA pod was stuck in a CrashLoopBackOff state within my k3s deployment on a TrueNAS Scale system. Below are the relevant system specifications:

k3s version: v1.26.6+k3s-6a894050-dirty (go version go1.19.8)
GPU: NVIDIA Tesla P4 (GP104GL, 83:00.0 3D controller)
Network Card: Intel E1G44ET2 Quad Port Server Card (Gigabit) / 4x RJ45
Motherboard: Supermicro X9DRi-F, dual Intel Xeon E5 2630v2, 128GB LRDIMM
Storage: 4x Seagate Enterprise ST6000NM0054 drives in RAIDZ2

Problem Overview
The NVIDIA device plugin daemonset was perpetually stuck in a CrashLoopBackOff state, as shown below:

kube-system nvidia-device-plugin-daemonset-hlx5j 0/1 CrashLoopBackOff 10 (10s ago) 39m 172.16.0.17 ix-truenas

Upon investigating the logs using the command k3s kubectl logs -n kube-system nvidia-device-plugin-daemonset-hlx5j, I observed the following log output:

2024/09/25 13:34:49 Starting FS watcher.
2024/09/25 13:34:49 Starting OS watcher.
2024/09/25 13:34:49 Starting Plugins.
2024/09/25 13:34:49 Loading configuration.
2024/09/25 13:34:49 Updating config with default resource matching patterns.
2024/09/25 13:34:49 Running with config:
{
“version”: “v1”,
“flags”: {
“migStrategy”: “none”,
“failOnInitError”: true,
“nvidiaDriverRoot”: “/”,
“gdsEnabled”: false,
“mofedEnabled”: false,
“plugin”: {
“passDeviceSpecs”: false,
“deviceListStrategy”: “envvar”,
“deviceIDStrategy”: “uuid”
}
},
“resources”: {
“gpus”: [
{
“pattern”: “*”,
“name”: “nvidia.com/gpu”
}
]
},
“sharing”: {
“timeSlicing”: {
“resources”: [
{
“name”: “nvidia.com/gpu”,
“devices”: “all”,
“replicas”: 5
}
]
}
}
}
2024/09/25 13:34:49 Detected NVML platform: found NVML library
2024/09/25 13:34:49 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/09/25 13:34:50 Starting GRPC server for ‘nvidia.com/gpu’
2024/09/25 13:34:50 Starting to serve ‘nvidia.com/gpu’ on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/09/25 13:34:50 Registered device plugin for ‘nvidia.com/gpu’ with Kubelet
2024/09/25 13:36:16 Received signal “terminated”, shutting down.
2024/09/25 13:36:16 Stopping plugins.
2024/09/25 13:36:16 Stopping to serve ‘nvidia.com/gpu’ on /var/lib/kubelet/device-plugins/nvidia-gpu.sock

The pod was successfully starting, registering with Kubelet, and then unexpectedly terminating shortly after. Below are the steps I took to troubleshoot and ultimately resolve the issue.

Troubleshooting Process and Solution:
Reinstallation of TrueNAS Scale:
As an initial troubleshooting step, I reinstalled TrueNAS Scale multiple times using various USB sticks (five in total) created via BalenaEtcher. Unfortunately, this yielded the same result, with the NVIDIA device plugin continuing to crash.

Modifying config.toml.tmpl:
After further investigation, I found a forum post suggesting a modification to the container runtime configuration. Specifically, it was recommended to add the following line to a config.toml.tmpl file located at /mnt/REDUNDANT_POOL/ix-applications/k3s/agent/etc/containerd:

SystemdCgroup = {{ .SystemdCgroup }}

I had to manually create the config.toml.tmpl file. For its contents, I copied the existing configuration from the config.toml file in the same directory (/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/etc/containerd). The complete configuration of the config.toml.tmpl file is as follows:

version = 2

[plugins.“io.containerd.internal.v1.opt”]
path = “/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/containerd”

[plugins.“io.containerd.grpc.v1.cri”]
stream_server_address = “127.0.0.1”
stream_server_port = “10010”
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = “rancher/mirrored-pause:3.6”

[plugins.“io.containerd.grpc.v1.cri”.containerd]
snapshotter = “overlayfs”
disable_snapshot_annotations = true

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc]
runtime_type = “io.containerd.runc.v2”

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.“nvidia”]
runtime_type = “io.containerd.runc.v2”

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.“nvidia”.options]
BinaryName = “/usr/bin/nvidia-container-runtime”
SystemdCgroup = true

Restarting k3s Service:
After finalizing the config.toml.tmpl file and applying the necessary modifications, I deleted the previous instance of the NVIDIA pod and restarted the k3s service. This ultimately resolved the issue with the CrashLoopBackOff, and since making this adjustment, the NVIDIA pod has remained stable with no further crashes.

Updated Configuration:
Below is the configuration after applying the changes to the config.toml.tmpl file:

version = 2

[plugins.“io.containerd.internal.v1.opt”]
path = “/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/containerd”

[plugins.“io.containerd.grpc.v1.cri”]
stream_server_address = “127.0.0.1”
stream_server_port = “10010”
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = “rancher/mirrored-pause:3.6”

[plugins.“io.containerd.grpc.v1.cri”.containerd]
snapshotter = “overlayfs”
disable_snapshot_annotations = true

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc]
runtime_type = “io.containerd.runc.v2”

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.“nvidia”]
runtime_type = “io.containerd.runc.v2”

[plugins.“io.containerd.grpc.v1.cri”.containerd.runtimes.“nvidia”.options]
BinaryName = “/usr/bin/nvidia-container-runtime”
SystemdCgroup = {{ .SystemdCgroup }}

The key change in this configuration is the addition of the following line under the NVIDIA runtime options:

SystemdCgroup = {{ .SystemdCgroup }}

This change, combined with restarting the k3s service, has effectively resolved the issue with the NVIDIA device plugin crashing.

donEddie7 · October 10, 2024, 2:13am

Is this the sort of issue/bug that is generally fixed in future releases/patches? I just setup my TrueNAS Scale server and am noticing the same behavior that you described. I am tempted to follow your instructions to get my GPU running consistently but a little green and not sure I should go through the trouble if this will be patched/fixed.

For what it’s worth, I did try navigating to the directory you mentioned but keep getting access/permission denied when trying to get to the agent directory under k3s. I’m logged in as admin.

LarsR · October 10, 2024, 7:05am

The issue should not be relevant in electric eel (available as RC2 version right now and release due end of the month), because the apps backend switches from kubernetes to docker and this was a kubernetes specific problem.

fayelund · October 10, 2024, 1:19pm

You are a super hero, this has bugged me for several monthsz with ix just pointing to upstream error.