Hi everyone!
I do need some help with my GPU on truenas as its my first time trying to set it up.
For some unknown reason to me, my nvidia-device-plugin-daemonset-s8hs8 seems to keep crashing. I am totally perplex and its beyond my knowledge to troubleshoot this. Hopefully someone here could help enlighten me and point me down the right path of what to do.
My system configurations is as such at the moment:
CPU: AMD Ryzen9 3900X
Memory: 64GB
GPU: EVGA RTX3060 12GB
Pool size: 1 x Mirrored 1TB SSD (For now)
type orroot@truenas[~]# k3s kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system csi-nfs-controller-7b74694749-qz4n4 0/4 TaintToleration 0 5h16m
ix-plex plex-5456778fb6-84dtm 0/1 UnexpectedAdmissionError 0 3h54m
kube-system csi-nfs-node-qw9d9 3/3 Running 3 (3h44m ago) 5h16m
kube-system csi-smb-node-h54dr 3/3 Running 3 (3h44m ago) 5h16m
kube-system snapshot-controller-546868dfb4-hcqsv 0/1 TaintToleration 0 5h16m
kube-system csi-smb-controller-7fbbb8fb6f-ktfv7 0/3 TaintToleration 0 5h16m
kube-system snapshot-controller-546868dfb4-xjgz6 0/1 TaintToleration 0 5h16m
kube-system csi-nfs-controller-7b74694749-kd2kr 4/4 Running 0 3h42m
kube-system coredns-59b4f5bbd5-nxbj4 1/1 Running 0 3h42m
kube-system snapshot-controller-546868dfb4-cpj7f 1/1 Running 0 3h42m
kube-system csi-smb-controller-7fbbb8fb6f-4k986 3/3 Running 0 3h42m
kube-system snapshot-controller-546868dfb4-jstjs 1/1 Running 0 3h42m
ix-plex plex-5984d9cb8b-vrbqf 1/1 Running 0 3h36m
kube-system nvidia-device-plugin-daemonset-s8hs8 0/1 CrashLoopBackOff 39 (13s ago) 3h42m paste code here
root@truenas[~]# k3s kubectl describe nodes ix-truenas
Name: ix-truenas
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=ix-truenas
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=true
node-role.kubernetes.io/master=true
Annotations: csi.volume.kubernetes.io/nodeid: {"nfs.csi.k8s.io":"ix-truenas","smb.csi.k8s.io":"ix-truenas"}
k3s.io/node-args:
["server","--cluster-cidr","172.16.0.0/16","--cluster-dns","172.17.0.10","--data-dir","/mnt/VMs and Apps/ix-applications/k3s","--disable",...
k3s.io/node-config-hash: FKSKZLLDDIEFXERCQSDV7UX226TSV4KMNQ5JAHCDEFWSGCFGUNLQ====
k3s.io/node-env:
{"K3S_DATA_DIR":"/mnt/VMs and Apps/ix-applications/k3s/data/203b9c5ec6ef066e14ed69ff770f7ac5023555505d8fc914c3e028bd9ce8b112"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 21 Jun 2024 09:11:45 -0700
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ix-truenas
AcquireTime: <unset>
RenewTime: Fri, 21 Jun 2024 14:27:31 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 21 Jun 2024 14:26:45 -0700 Fri, 21 Jun 2024 09:11:44 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Jun 2024 14:26:45 -0700 Fri, 21 Jun 2024 09:11:44 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Jun 2024 14:26:45 -0700 Fri, 21 Jun 2024 09:11:44 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 21 Jun 2024 14:26:45 -0700 Fri, 21 Jun 2024 10:45:24 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.1.96
Hostname: ix-truenas
Capacity:
cpu: 24
ephemeral-storage: 919508Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65755996Ki
nvidia.com/gpu: 5
pods: 250
Allocatable:
cpu: 24
ephemeral-storage: 915965318860
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65755996Ki
nvidia.com/gpu: 5
pods: 250
System Info:
Machine ID: b39acfd315b340329cd5428a9015dd99
System UUID: 42dbae95-d904-5ef5-c7f7-04d9f55ec7f6
Boot ID: 5006d688-0515-492d-92ea-9e3277947553
Kernel Version: 6.6.29-production+truenas
OS Image: Debian GNU/Linux 12 (bookworm)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://Unknown
Kubelet Version: v1.26.6+k3s-6a894050-dirty
Kube-Proxy Version: v1.26.6+k3s-6a894050-dirty
PodCIDR: 172.16.0.0/16
PodCIDRs: 172.16.0.0/16
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system csi-nfs-node-qw9d9 30m (0%) 0 (0%) 60Mi (0%) 500Mi (0%) 5h15m
kube-system csi-smb-node-h54dr 30m (0%) 0 (0%) 60Mi (0%) 400Mi (0%) 5h15m
kube-system csi-nfs-controller-7b74694749-kd2kr 40m (0%) 0 (0%) 80Mi (0%) 900Mi (1%) 3h42m
kube-system coredns-59b4f5bbd5-nxbj4 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 3h42m
kube-system snapshot-controller-546868dfb4-cpj7f 10m (0%) 0 (0%) 20Mi (0%) 300Mi (0%) 3h42m
kube-system csi-smb-controller-7fbbb8fb6f-4k986 30m (0%) 2 (8%) 60Mi (0%) 600Mi (0%) 3h42m
kube-system snapshot-controller-546868dfb4-jstjs 10m (0%) 0 (0%) 20Mi (0%) 300Mi (0%) 3h42m
ix-plex plex-5984d9cb8b-vrbqf 10m (0%) 4 (16%) 50Mi (0%) 8Gi (12%) 3h35m
kube-system nvidia-device-plugin-daemonset-s8hs8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h42m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 260m (1%) 6 (25%)
memory 420Mi (0%) 11362Mi (17%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 1 1
Events: <none>
root@truenas[~]# nvidia-smi
Fri Jun 21 14:29:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:09:00.0 Off | N/A |
| 0% 55C P0 N/A / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Much appreciated.