Help with Nvidia GPU

Hi everyone!

I do need some help with my GPU on truenas as its my first time trying to set it up.
For some unknown reason to me, my nvidia-device-plugin-daemonset-s8hs8 seems to keep crashing. I am totally perplex and its beyond my knowledge to troubleshoot this. Hopefully someone here could help enlighten me and point me down the right path of what to do.

My system configurations is as such at the moment:
CPU: AMD Ryzen9 3900X
Memory: 64GB
GPU: EVGA RTX3060 12GB
Pool size: 1 x Mirrored 1TB SSD (For now)

type orroot@truenas[~]# k3s kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS                     RESTARTS        AGE
kube-system   csi-nfs-controller-7b74694749-qz4n4    0/4     TaintToleration            0               5h16m
ix-plex       plex-5456778fb6-84dtm                  0/1     UnexpectedAdmissionError   0               3h54m
kube-system   csi-nfs-node-qw9d9                     3/3     Running                    3 (3h44m ago)   5h16m
kube-system   csi-smb-node-h54dr                     3/3     Running                    3 (3h44m ago)   5h16m
kube-system   snapshot-controller-546868dfb4-hcqsv   0/1     TaintToleration            0               5h16m
kube-system   csi-smb-controller-7fbbb8fb6f-ktfv7    0/3     TaintToleration            0               5h16m
kube-system   snapshot-controller-546868dfb4-xjgz6   0/1     TaintToleration            0               5h16m
kube-system   csi-nfs-controller-7b74694749-kd2kr    4/4     Running                    0               3h42m
kube-system   coredns-59b4f5bbd5-nxbj4               1/1     Running                    0               3h42m
kube-system   snapshot-controller-546868dfb4-cpj7f   1/1     Running                    0               3h42m
kube-system   csi-smb-controller-7fbbb8fb6f-4k986    3/3     Running                    0               3h42m
kube-system   snapshot-controller-546868dfb4-jstjs   1/1     Running                    0               3h42m
ix-plex       plex-5984d9cb8b-vrbqf                  1/1     Running                    0               3h36m
kube-system   nvidia-device-plugin-daemonset-s8hs8   0/1     CrashLoopBackOff           39 (13s ago)    3h42m paste code here
root@truenas[~]# k3s kubectl describe nodes ix-truenas
Name:               ix-truenas
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ix-truenas
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
Annotations:        csi.volume.kubernetes.io/nodeid: {"nfs.csi.k8s.io":"ix-truenas","smb.csi.k8s.io":"ix-truenas"}
                    k3s.io/node-args:
                      ["server","--cluster-cidr","172.16.0.0/16","--cluster-dns","172.17.0.10","--data-dir","/mnt/VMs and Apps/ix-applications/k3s","--disable",...
                    k3s.io/node-config-hash: FKSKZLLDDIEFXERCQSDV7UX226TSV4KMNQ5JAHCDEFWSGCFGUNLQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/mnt/VMs and Apps/ix-applications/k3s/data/203b9c5ec6ef066e14ed69ff770f7ac5023555505d8fc914c3e028bd9ce8b112"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 21 Jun 2024 09:11:45 -0700
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ix-truenas
  AcquireTime:     <unset>
  RenewTime:       Fri, 21 Jun 2024 14:27:31 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 21 Jun 2024 14:26:45 -0700   Fri, 21 Jun 2024 09:11:44 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 21 Jun 2024 14:26:45 -0700   Fri, 21 Jun 2024 09:11:44 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 21 Jun 2024 14:26:45 -0700   Fri, 21 Jun 2024 09:11:44 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 21 Jun 2024 14:26:45 -0700   Fri, 21 Jun 2024 10:45:24 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.96
  Hostname:    ix-truenas
Capacity:
  cpu:                24
  ephemeral-storage:  919508Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65755996Ki
  nvidia.com/gpu:     5
  pods:               250
Allocatable:
  cpu:                24
  ephemeral-storage:  915965318860
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65755996Ki
  nvidia.com/gpu:     5
  pods:               250
System Info:
  Machine ID:                 b39acfd315b340329cd5428a9015dd99
  System UUID:                42dbae95-d904-5ef5-c7f7-04d9f55ec7f6
  Boot ID:                    5006d688-0515-492d-92ea-9e3277947553
  Kernel Version:             6.6.29-production+truenas
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://Unknown
  Kubelet Version:            v1.26.6+k3s-6a894050-dirty
  Kube-Proxy Version:         v1.26.6+k3s-6a894050-dirty
PodCIDR:                      172.16.0.0/16
PodCIDRs:                     172.16.0.0/16
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 csi-nfs-node-qw9d9                      30m (0%)      0 (0%)      60Mi (0%)        500Mi (0%)     5h15m
  kube-system                 csi-smb-node-h54dr                      30m (0%)      0 (0%)      60Mi (0%)        400Mi (0%)     5h15m
  kube-system                 csi-nfs-controller-7b74694749-kd2kr     40m (0%)      0 (0%)      80Mi (0%)        900Mi (1%)     3h42m
  kube-system                 coredns-59b4f5bbd5-nxbj4                100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h42m
  kube-system                 snapshot-controller-546868dfb4-cpj7f    10m (0%)      0 (0%)      20Mi (0%)        300Mi (0%)     3h42m
  kube-system                 csi-smb-controller-7fbbb8fb6f-4k986     30m (0%)      2 (8%)      60Mi (0%)        600Mi (0%)     3h42m
  kube-system                 snapshot-controller-546868dfb4-jstjs    10m (0%)      0 (0%)      20Mi (0%)        300Mi (0%)     3h42m
  ix-plex                     plex-5984d9cb8b-vrbqf                   10m (0%)      4 (16%)     50Mi (0%)        8Gi (12%)      3h35m
  kube-system                 nvidia-device-plugin-daemonset-s8hs8    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h42m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                260m (1%)   6 (25%)
  memory             420Mi (0%)  11362Mi (17%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     1           1
Events:              <none>
root@truenas[~]# nvidia-smi
Fri Jun 21 14:29:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:09:00.0 Off |                  N/A |
|  0%   55C    P0              N/A / 170W |      1MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Much appreciated.

it always seems to go from Running > Complete > CrashLoopBackOff after 3mins

kube-system nvidia-device-plugin-daemonset-kfp7g 1/1 Running 5 (2m55s ago) 8m26s
kube-system nvidia-device-plugin-daemonset-kfp7g 0/1 CrashLoopBackOff 5 (8s ago) 8m28s

Tested the system with a known good graphics card. Same issue occurs albeit faster recovery timing seen of the CrashLoopBackOff error

Should I just do a reinstall of Scale?

I’d try unsetting and resetting the Apps service first - but I spotted something else in the logs here:

/mnt/VMs and Apps

Your pool has a space in the name (two, actually) - that might be throwing a script for a loop somewhere. If you’ve got nothing set up so far, you could try unsetting the Apps service (Apps > Settings > Unset Pool) then renaming the pool using hyphens or underscores instead of spaces, then choose your newly renamed pool (VMs-and-Apps) and try again.

could you guide me through what i should be doing apart from renaming the pool. Have no idea what you mean by unsetting and resetting the App service.

NVM ive been hard of understanding recently… i finally got it

1 Like

No problem. The best to time learn and ask questions is before loading critical data onto the system, as it gives you a lot more latitude to clean-slate things and begin anew without any loss.

Totally agree… the system still sitting on my desk in parts for testing and learning hahaha…

for convenience sake and since there was nothing important on the pool, I simply destroyed the whole pool after unsetting the Apps.

Now after creating the pool again and setting it to apps im getting the following error. Any insights?

Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/catalogs_linux/utils.py", line 37, in pull_clone_repository
    clone_repository(repository_uri, destination, branch, depth)
  File "/usr/lib/python3/dist-packages/middlewared/utils/git.py", line 25, in clone_repository
    raise CallError(
middlewared.service_exception.CallError: [EFAULT] Failed to clone 'https://github.com/truenas/charts.git' repository at '/mnt/Apps-and-VMs/ix-applications/catalogs/github_com_truenas_charts_git_master' destination: Cloning into '/mnt/Apps-and-VMs/ix-...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 469, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 511, in __run_body
    rv = await self.method(*args)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 187, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 47, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/catalogs_linux/sync_catalogs.py", line 61, in sync
    await self.middleware.call('catalog.update_git_repository', catalog)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1564, in call
    return await self._call(
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1428, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1321, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/catalogs_linux/sync_catalogs.py", line 89, in update_git_repository
    return pull_clone_repository(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/catalogs_linux/utils.py", line 39, in pull_clone_repository
    raise CallError(f'Failed to clone {repository_uri!r} repository at {destination!r} destination: {e}')
middlewared.service_exception.CallError: [EFAULT] Failed to clone 'https://github.com/truenas/charts.git' repository at '/mnt/Apps-and-VMs/ix-applications/catalogs/github_com_truenas_charts_git_master' destination: [EFAULT] Failed to clone 'https://github.com/truenas/charts.git' repository at '/mnt/Apps-and-VMs/ix-applications/catalogs/github_com_truenas_charts_git_master' destination: Cloning into '/mnt/Apps-and-VMs/ix-...

Patience is definitely not my virtue hahaha… i did a full configuration reset. git repo is now synchronizing. Will monitor the nvidia daemonset now.

I just wanna check that multiple restarts of the pod is not actually a normal behavior right?

root@truenas[~]# k3s kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS            RESTARTS        AGE
kube-system   snapshot-controller-546868dfb4-g6t54   0/1     TaintToleration   0               19m
kube-system   snapshot-controller-546868dfb4-8bz9s   0/1     TaintToleration   0               19m
kube-system   coredns-59b4f5bbd5-s44qz               0/1     TaintToleration   0               19m
kube-system   csi-nfs-controller-7b74694749-b8m8b    0/4     TaintToleration   0               19m
kube-system   csi-smb-controller-7fbbb8fb6f-tdrkm    0/3     Error             0               19m
kube-system   csi-nfs-node-wzpmn                     3/3     Running           0               15m
kube-system   csi-smb-node-xqj4x                     3/3     Running           0               15m
kube-system   snapshot-controller-546868dfb4-csj8g   1/1     Running           0               15m
kube-system   snapshot-controller-546868dfb4-d4ln5   1/1     Running           0               15m
kube-system   coredns-59b4f5bbd5-qlthd               1/1     Running           0               15m
kube-system   csi-nfs-controller-7b74694749-mssdq    4/4     Running           0               15m
kube-system   csi-smb-controller-7fbbb8fb6f-fjfzv    3/3     Running           0               12m
kube-system   nvidia-device-plugin-daemonset-lv5vg   1/1     Running           6 (3m50s ago)   12m

I have the exact same problem with the Nvidia pod crashing, never got it fixed.

i see that im not the only one experiencing this… I was wondering if its an issue with Dragonfish and so I installed Cobia to confirm my suspicion and found that I was right… the pod doesnt seem to be working properly on 24.04.1.1

On Cobia the pod has been running for 8mins without a single restart.

root@truenas[~]# k3s kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   snapshot-controller-546868dfb4-khhb5   1/1     Running   0          8m41s
kube-system   snapshot-controller-546868dfb4-tzmsx   1/1     Running   0          8m41s
kube-system   coredns-59b4f5bbd5-w7nkw               1/1     Running   0          8m41s
kube-system   nvidia-device-plugin-daemonset-6sc6v   1/1     Running   0          8m41s
kube-system   openebs-zfs-controller-0               5/5     Running   0          8m41s
kube-system   csi-nfs-controller-7b74694749-q4c7h    4/4     Running   0          8m41s
kube-system   csi-smb-node-8rfzv                     3/3     Running   0          8m42s
kube-system   openebs-zfs-node-xt88d                 2/2     Running   0          8m42s
kube-system   csi-nfs-node-pf7tq                     3/3     Running   0          8m42s
kube-system   csi-smb-controller-7fbbb8fb6f-bsnp6    3/3     Running   0          8m41s
ix-plex       plex-6867bd9b54-x24fw                  1/1     Running   0          2m59s

whereas on 24.04.1.1 Dragonfish the pod has restarted 3 times in just 4 mins and 6 times in 13mins.
These are all checks done on a freshly downloaded copy of the ISO and using 2 different installation medias.

root@truenas[~]# k3s kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS      AGE
kube-system   snapshot-controller-546868dfb4-rddwl   1/1     Running   0             4m31s
kube-system   snapshot-controller-546868dfb4-x9w22   1/1     Running   0             4m31s
kube-system   coredns-59b4f5bbd5-kz8pq               1/1     Running   0             4m31s
kube-system   csi-smb-node-s2krq                     3/3     Running   0             4m31s
kube-system   csi-smb-controller-7fbbb8fb6f-49cwh    3/3     Running   0             4m31s
kube-system   csi-nfs-node-ggkdt                     3/3     Running   0             4m31s
kube-system   csi-nfs-controller-7b74694749-v6t9s    4/4     Running   0             4m31s
kube-system   nvidia-device-plugin-daemonset-gc54r   1/1     Running   3 (76s ago)   4m30s

root@truenas[~]# k3s kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS             RESTARTS      AGE
kube-system   snapshot-controller-546868dfb4-rddwl   1/1     Running            0             13m
kube-system   snapshot-controller-546868dfb4-x9w22   1/1     Running            0             13m
kube-system   coredns-59b4f5bbd5-kz8pq               1/1     Running            0             13m
kube-system   csi-smb-node-s2krq                     3/3     Running            0             13m
kube-system   csi-smb-controller-7fbbb8fb6f-49cwh    3/3     Running            0             13m
kube-system   csi-nfs-node-ggkdt                     3/3     Running            0             13m
kube-system   csi-nfs-controller-7b74694749-v6t9s    4/4     Running            0             13m
kube-system   nvidia-device-plugin-daemonset-gc54r   0/1     CrashLoopBackOff   6 (68s ago)   13m

Now should I wait for 24.10 Electric Eel? or Just install Cobia… I’m really afraid of the generational gap between Cobia and Electric Eel when it comes time for me to upgrade… And would I be missing out on the 2 x Larger ARC…

I would submit a ticket to ix stating your findings. As I was in a position where I could not roll back I couldn’t test covia again myself.

Thats a great idea! I will drop my findings to them now. Hopefully i get some sort of resolution to this.