This guide will help you change the replicas setting in the NVIDIA device plugin configuration for Kubernetes, specifically on a TrueNAS Scale system using k3s. We’ll walk you through locating and editing the configuration step by step.
Step 1: Locate the NVIDIA Device Plugin ConfigMap
The NVIDIA device plugin uses a configuration stored in a Kubernetes ConfigMap. You need to retrieve the contents of this ConfigMap first.
-
Open a terminal session and run the following command to fetch the configuration of the NVIDIA device plugin:
k3s kubectl get configmap -n kube-system nvidia-device-plugin-config -o yaml -
The output should show you the
nvdefault.yamlfile embedded inside the ConfigMap. Look for this section in the output:data: nvdefault.yaml: | version: v1 sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 5
Step 2: Edit the ConfigMap to Modify the replicas Setting
Now that you’ve found the configuration, you need to change the replicas value from 5 to 1.
-
To edit the ConfigMap, use the following command:
k3s kubectl edit configmap -n kube-system nvidia-device-plugin-config -
This will open the ConfigMap in the default text editor (usually
vim). Inside the editor, find this section:replicas: 5 -
Change the value of
replicasfrom5to1:replicas: 1
Step 3: Save and Exit the Editor
After you’ve modified the replicas value, save and close the file.
- If you’re using
vim, do the following:- Press
Escto enter command mode. - Type
:wqand hitEnterto write the changes and quit the editor.
- Press
Step 4: Restart the NVIDIA Device Plugin DaemonSet
To apply the new configuration, you need to restart the NVIDIA device plugin. The easiest way to do this is by deleting the existing pod(s) associated with the DaemonSet. Kubernetes will automatically recreate them with the updated configuration.
-
Run this command to delete the existing NVIDIA device plugin pods:
k3s kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds -
The DaemonSet will automatically recreate the pod(s) with the updated configuration, which will now use
replicas: 1.
Step 5: Verify the Change
Once the new pod is running, you can check the logs to ensure that the new configuration with replicas: 1 is being applied.
-
Get the name of the newly created pod:
k3s kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -
Check the logs of the new pod:
k3s kubectl logs -n kube-system <new-nvidia-pod-name> -
In the logs, you should see the updated configuration reflecting
replicas: 1.
If Changes Don’t Work: Disable Time-Slicing
If you encounter issues with the pod crashing after changing the replicas, try disabling time-slicing. This can allow for just one GPU to be allocated without conflicts.
Steps to Disable Time-Slicing
- Edit the ConfigMap:
- Open the ConfigMap for editing again:
k3s kubectl edit configmap -n kube-system nvidia-device-plugin-config
- Locate the Time-Slicing Section:
- Find the
timeSlicingsection within thenvdefault.yamldata. It should look similar to this:
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 1
- Modify the Time-Slicing Configuration:
- You can remove the
timeSlicingsection entirely or set it to disable time-slicing by adjusting its parameters. Here’s how you can disable it:
sharing:
# Remove the timeSlicing section entirely or modify it to:
# timeSlicing: {} # Empty section to effectively disable.
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 1
- Save and Exit:
- Save your changes and exit the editor.
- Restart the NVIDIA Device Plugin Pods:
- Delete the current pod(s) again to ensure they pick up the new configuration:
k3s kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds
- Verify the Changes:
- After the new pod is up, check the logs to ensure it’s running properly:
k3s kubectl logs -n kube-system <new-nvidia-pod-name>