I recently encountered a persistent issue where my NVIDIA pod was stuck in a CrashLoopBackOff state within my k3s deployment on a TrueNAS Scale system. Below are the relevant system specifications:
- k3s version: v1.26.6+k3s-6a894050-dirty (go version go1.19.8)
- GPU: NVIDIA Tesla P4 (GP104GL, 83:00.0 3D controller)
- Network Card: Intel E1G44ET2 Quad Port Server Card (Gigabit) / 4x RJ45
- Motherboard: Supermicro X9DRi-F, dual Intel Xeon E5 2630v2, 128GB LRDIMM
- Storage: 4x Seagate Enterprise ST6000NM0054 drives in RAIDZ2
Problem Overview
The NVIDIA device plugin daemonset was perpetually stuck in a CrashLoopBackOff state, as shown below:
kube-system nvidia-device-plugin-daemonset-hlx5j 0/1 CrashLoopBackOff 10 (10s ago) 39m 172.16.0.17 ix-truenas
Upon investigating the logs using the command k3s kubectl logs -n kube-system nvidia-device-plugin-daemonset-hlx5j, I observed the following log output:
2024/09/25 13:34:49 Starting FS watcher.
2024/09/25 13:34:49 Starting OS watcher.
2024/09/25 13:34:49 Starting Plugins.
2024/09/25 13:34:49 Loading configuration.
2024/09/25 13:34:49 Updating config with default resource matching patterns.
2024/09/25 13:34:49 Running with config:
{
āversionā: āv1ā,
āflagsā: {
āmigStrategyā: ānoneā,
āfailOnInitErrorā: true,
ānvidiaDriverRootā: ā/ā,
āgdsEnabledā: false,
āmofedEnabledā: false,
āpluginā: {
āpassDeviceSpecsā: false,
ādeviceListStrategyā: āenvvarā,
ādeviceIDStrategyā: āuuidā
}
},
āresourcesā: {
āgpusā: [
{
āpatternā: ā*ā,
ānameā: ānvidia.com/gpuā
}
]
},
āsharingā: {
ātimeSlicingā: {
āresourcesā: [
{
ānameā: ānvidia.com/gpuā,
ādevicesā: āallā,
āreplicasā: 5
}
]
}
}
}
2024/09/25 13:34:49 Detected NVML platform: found NVML library
2024/09/25 13:34:49 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/09/25 13:34:50 Starting GRPC server for ānvidia.com/gpuā
2024/09/25 13:34:50 Starting to serve ānvidia.com/gpuā on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/09/25 13:34:50 Registered device plugin for ānvidia.com/gpuā with Kubelet
2024/09/25 13:36:16 Received signal āterminatedā, shutting down.
2024/09/25 13:36:16 Stopping plugins.
2024/09/25 13:36:16 Stopping to serve ānvidia.com/gpuā on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
The pod was successfully starting, registering with Kubelet, and then unexpectedly terminating shortly after. Below are the steps I took to troubleshoot and ultimately resolve the issue.
Troubleshooting Process and Solution:
Reinstallation of TrueNAS Scale:
As an initial troubleshooting step, I reinstalled TrueNAS Scale multiple times using various USB sticks (five in total) created via BalenaEtcher. Unfortunately, this yielded the same result, with the NVIDIA device plugin continuing to crash.
Modifying config.toml.tmpl:
After further investigation, I found a forum post suggesting a modification to the container runtime configuration. Specifically, it was recommended to add the following line to a config.toml.tmpl file located at /mnt/REDUNDANT_POOL/ix-applications/k3s/agent/etc/containerd:
SystemdCgroup = {{ .SystemdCgroup }}
I had to manually create the config.toml.tmpl file. For its contents, I copied the existing configuration from the config.toml file in the same directory (/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/etc/containerd). The complete configuration of the config.toml.tmpl file is as follows:
version = 2
[plugins.āio.containerd.internal.v1.optā]
path = ā/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/containerdā[plugins.āio.containerd.grpc.v1.criā]
stream_server_address = ā127.0.0.1ā
stream_server_port = ā10010ā
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = ārancher/mirrored-pause:3.6ā[plugins.āio.containerd.grpc.v1.criā.containerd]
snapshotter = āoverlayfsā
disable_snapshot_annotations = true[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.runc]
runtime_type = āio.containerd.runc.v2ā[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.runc.options]
SystemdCgroup = true[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.ānvidiaā]
runtime_type = āio.containerd.runc.v2ā[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.ānvidiaā.options]
BinaryName = ā/usr/bin/nvidia-container-runtimeā
SystemdCgroup = true
Restarting k3s Service:
After finalizing the config.toml.tmpl file and applying the necessary modifications, I deleted the previous instance of the NVIDIA pod and restarted the k3s service. This ultimately resolved the issue with the CrashLoopBackOff, and since making this adjustment, the NVIDIA pod has remained stable with no further crashes.
Updated Configuration:
Below is the configuration after applying the changes to the config.toml.tmpl file:
version = 2
[plugins.āio.containerd.internal.v1.optā]
path = ā/mnt/REDUNDANT_POOL/ix-applications/k3s/agent/containerdā[plugins.āio.containerd.grpc.v1.criā]
stream_server_address = ā127.0.0.1ā
stream_server_port = ā10010ā
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = ārancher/mirrored-pause:3.6ā[plugins.āio.containerd.grpc.v1.criā.containerd]
snapshotter = āoverlayfsā
disable_snapshot_annotations = true[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.runc]
runtime_type = āio.containerd.runc.v2ā[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.runc.options]
SystemdCgroup = true[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.ānvidiaā]
runtime_type = āio.containerd.runc.v2ā[plugins.āio.containerd.grpc.v1.criā.containerd.runtimes.ānvidiaā.options]
BinaryName = ā/usr/bin/nvidia-container-runtimeā
SystemdCgroup = {{ .SystemdCgroup }}
The key change in this configuration is the addition of the following line under the NVIDIA runtime options:
SystemdCgroup = {{ .SystemdCgroup }}
This change, combined with restarting the k3s service, has effectively resolved the issue with the NVIDIA device plugin crashing.