Step-by-Step Guide: Modifying `replicas` in the NVIDIA Device Plugin Configuration

This guide will help you change the replicas setting in the NVIDIA device plugin configuration for Kubernetes, specifically on a TrueNAS Scale system using k3s. We’ll walk you through locating and editing the configuration step by step.

Step 1: Locate the NVIDIA Device Plugin ConfigMap

The NVIDIA device plugin uses a configuration stored in a Kubernetes ConfigMap. You need to retrieve the contents of this ConfigMap first.

  1. Open a terminal session and run the following command to fetch the configuration of the NVIDIA device plugin:

    k3s kubectl get configmap -n kube-system nvidia-device-plugin-config -o yaml
    
  2. The output should show you the nvdefault.yaml file embedded inside the ConfigMap. Look for this section in the output:

    data:
      nvdefault.yaml: |
        version: v1
        sharing:
          timeSlicing:
            renameByDefault: false
            failRequestsGreaterThanOne: false
            resources:
            - name: nvidia.com/gpu
              replicas: 5
    

Step 2: Edit the ConfigMap to Modify the replicas Setting

Now that you’ve found the configuration, you need to change the replicas value from 5 to 1.

  1. To edit the ConfigMap, use the following command:

    k3s kubectl edit configmap -n kube-system nvidia-device-plugin-config
    
  2. This will open the ConfigMap in the default text editor (usually vim). Inside the editor, find this section:

    replicas: 5
    
  3. Change the value of replicas from 5 to 1:

    replicas: 1
    

Step 3: Save and Exit the Editor

After you’ve modified the replicas value, save and close the file.

  • If you’re using vim, do the following:
    1. Press Esc to enter command mode.
    2. Type :wq and hit Enter to write the changes and quit the editor.

Step 4: Restart the NVIDIA Device Plugin DaemonSet

To apply the new configuration, you need to restart the NVIDIA device plugin. The easiest way to do this is by deleting the existing pod(s) associated with the DaemonSet. Kubernetes will automatically recreate them with the updated configuration.

  1. Run this command to delete the existing NVIDIA device plugin pods:

    k3s kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds
    
  2. The DaemonSet will automatically recreate the pod(s) with the updated configuration, which will now use replicas: 1.


Step 5: Verify the Change

Once the new pod is running, you can check the logs to ensure that the new configuration with replicas: 1 is being applied.

  1. Get the name of the newly created pod:

    k3s kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
    
  2. Check the logs of the new pod:

    k3s kubectl logs -n kube-system <new-nvidia-pod-name>
    
  3. In the logs, you should see the updated configuration reflecting
    replicas: 1.

If Changes Don’t Work: Disable Time-Slicing

If you encounter issues with the pod crashing after changing the replicas, try disabling time-slicing. This can allow for just one GPU to be allocated without conflicts.

Steps to Disable Time-Slicing

  1. Edit the ConfigMap:
  • Open the ConfigMap for editing again:
k3s kubectl edit configmap -n kube-system nvidia-device-plugin-config
  1. Locate the Time-Slicing Section:
  • Find the timeSlicing section within the nvdefault.yaml data. It should look similar to this:
sharing:
  timeSlicing:
    renameByDefault: false
    failRequestsGreaterThanOne: false
    resources:
    - name: nvidia.com/gpu
      replicas: 1
  1. Modify the Time-Slicing Configuration:
  • You can remove the timeSlicing section entirely or set it to disable time-slicing by adjusting its parameters. Here’s how you can disable it:
sharing:
  # Remove the timeSlicing section entirely or modify it to:
  # timeSlicing: {}  # Empty section to effectively disable.
  renameByDefault: false
  failRequestsGreaterThanOne: false
  resources:
  - name: nvidia.com/gpu
    replicas: 1
  1. Save and Exit:
  • Save your changes and exit the editor.
  1. Restart the NVIDIA Device Plugin Pods:
  • Delete the current pod(s) again to ensure they pick up the new configuration:
k3s kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds
  1. Verify the Changes:
  • After the new pod is up, check the logs to ensure it’s running properly:
k3s kubectl logs -n kube-system <new-nvidia-pod-name>