Hi. I’ve just built a NAS with Dragonfish-24.04.1.1 and most things work. The bit I’m having trouble with is my GPU (nvidia p2000) - it is not available as a resource to pick for apps like Plex.
It sounds awfully like this post - I have no VMs, just the TrueNAS official Plex app installed.
I ran through the commands in the above post. Interestingly I can’t run k3s kubectl show pods -n kube-system even as sudo - I get told that show isn’t an available command. However, the hardware definitely shows and is available for isolation if I wanted to.
The k3s thing - I personally do Docker and use Amazon ECS for enterprise workloads, I don’t do K8S as a rule (far too overcomplicated is my personal view).
Anyway - what am I missing here? Or is this a known bug at present?
Have you checked the output of nvdia-smi from the truenas shell? If it says the card could not communicate with the driver tne your card is not supported by the nvidia-driver scale uses.
Other then that you could check with k3s kubectl get pods -A if the nvidia system pod is stuck in a crash loop.
Clearly a container handling the nvidia driver (?) is repeatedly crashing, as others are reporting.
Log file:
truenas% sudo cat 43.log
2024-07-04T14:10:24.134548872+01:00 stderr F 2024/07/04 13:10:24 Starting FS watcher.
2024-07-04T14:10:24.134588677+01:00 stderr F 2024/07/04 13:10:24 Starting OS watcher.
2024-07-04T14:10:24.134790434+01:00 stderr F 2024/07/04 13:10:24 Starting Plugins.
2024-07-04T14:10:24.134800283+01:00 stderr F 2024/07/04 13:10:24 Loading configuration.
2024-07-04T14:10:24.135017028+01:00 stderr F 2024/07/04 13:10:24 Updating config with default resource matching patterns.
2024-07-04T14:10:24.135037987+01:00 stderr F 2024/07/04 13:10:24
2024-07-04T14:10:24.135056232+01:00 stderr F Running with config:
2024-07-04T14:10:24.135066391+01:00 stderr F {
2024-07-04T14:10:24.135076219+01:00 stderr F "version": "v1",
2024-07-04T14:10:24.135087029+01:00 stderr F "flags": {
2024-07-04T14:10:24.135096196+01:00 stderr F "migStrategy": "none",
2024-07-04T14:10:24.135105043+01:00 stderr F "failOnInitError": true,
2024-07-04T14:10:24.135113879+01:00 stderr F "nvidiaDriverRoot": "/",
2024-07-04T14:10:24.135123808+01:00 stderr F "gdsEnabled": false,
2024-07-04T14:10:24.135138876+01:00 stderr F "mofedEnabled": false,
2024-07-04T14:10:24.135149125+01:00 stderr F "plugin": {
2024-07-04T14:10:24.135158112+01:00 stderr F "passDeviceSpecs": false,
2024-07-04T14:10:24.135167009+01:00 stderr F "deviceListStrategy": "envvar",
2024-07-04T14:10:24.135175845+01:00 stderr F "deviceIDStrategy": "uuid"
2024-07-04T14:10:24.135184622+01:00 stderr F }
2024-07-04T14:10:24.135193508+01:00 stderr F },
2024-07-04T14:10:24.135202285+01:00 stderr F "resources": {
2024-07-04T14:10:24.135217884+01:00 stderr F "gpus": [
2024-07-04T14:10:24.135227141+01:00 stderr F {
2024-07-04T14:10:24.135235968+01:00 stderr F "pattern": "*",
2024-07-04T14:10:24.135244774+01:00 stderr F "name": "nvidia.com/gpu"
2024-07-04T14:10:24.135253881+01:00 stderr F }
2024-07-04T14:10:24.135262708+01:00 stderr F ]
2024-07-04T14:10:24.135271574+01:00 stderr F },
2024-07-04T14:10:24.135280361+01:00 stderr F "sharing": {
2024-07-04T14:10:24.13529585+01:00 stderr F "timeSlicing": {
2024-07-04T14:10:24.135304877+01:00 stderr F "resources": [
2024-07-04T14:10:24.135313713+01:00 stderr F {
2024-07-04T14:10:24.13532255+01:00 stderr F "name": "nvidia.com/gpu",
2024-07-04T14:10:24.135331366+01:00 stderr F "devices": "all",
2024-07-04T14:10:24.135340163+01:00 stderr F "replicas": 5
2024-07-04T14:10:24.135348959+01:00 stderr F }
2024-07-04T14:10:24.135357746+01:00 stderr F ]
2024-07-04T14:10:24.135374357+01:00 stderr F }
2024-07-04T14:10:24.135383364+01:00 stderr F }
2024-07-04T14:10:24.1353922+01:00 stderr F }
2024-07-04T14:10:24.135401107+01:00 stderr F 2024/07/04 13:10:24 Retreiving plugins.
2024-07-04T14:10:24.135415985+01:00 stderr F 2024/07/04 13:10:24 Detected NVML platform: found NVML library
2024-07-04T14:10:24.135431534+01:00 stderr F 2024/07/04 13:10:24 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024-07-04T14:10:24.672101541+01:00 stderr F 2024/07/04 13:10:24 Starting GRPC server for 'nvidia.com/gpu'
2024-07-04T14:10:24.672216446+01:00 stderr F 2024/07/04 13:10:24 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024-07-04T14:10:24.673629289+01:00 stderr F 2024/07/04 13:10:24 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024-07-04T14:11:40.828161212+01:00 stderr F 2024/07/04 13:11:40 Received signal "terminated", shutting down.
2024-07-04T14:11:40.82820824+01:00 stderr F 2024/07/04 13:11:40 Stopping plugins.
2024-07-04T14:11:40.828235941+01:00 stderr F 2024/07/04 13:11:40 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
When i was still using the buildin apps system i also had the nvidia pod stuck in a crashloop. If i remember correctly my fix was to disable the gpu support for k3s in advanced options, reboot, wait a couple of minutes, enable gpu support for k3s, check with k3s kubectl get pods -A if the pod started correctly and if yes assign the gpu to my apps.
What’s the open/active Jira ticket for this? I see a few that seem to indicate it may be an upstream Kubernetes issue, but I’m not sure which one is being referenced or tracked here.
@Airekris tagging you here as I recall you from the other thread - if you have one as well please let me know.
where you have to replace the XXXXX with your own unique-ID obtained from the sudo k3s kubectl get pods -A command, (so @James_Green you would use k58cb there)
Can you please attach them to the ticket? System debugs (System Settings → Advanced → Save Debug) can also be uploaded through the Private File Upload link in the comments of the ticket.
FWIW I’ve added a ticket, but I can equally see others having filed what appear to be matching tickets. I’d love to figure out now whether this is a TrueNAS issue, or an upstream matter.
@HoneyBadger I can, it appears to be the same as the log I already posted:
truenas% sudo k3s kubectl logs -p pod/nvidia-device-plugin-daemonset-k58cb --namespace=kube-system
2024/07/04 20:22:39 Starting FS watcher.
2024/07/04 20:22:39 Starting OS watcher.
2024/07/04 20:22:39 Starting Plugins.
2024/07/04 20:22:39 Loading configuration.
2024/07/04 20:22:39 Updating config with default resource matching patterns.
2024/07/04 20:22:39
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 5
}
]
}
}
}
2024/07/04 20:22:39 Retreiving plugins.
2024/07/04 20:22:39 Detected NVML platform: found NVML library
2024/07/04 20:22:39 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/04 20:22:39 Starting GRPC server for 'nvidia.com/gpu'
2024/07/04 20:22:39 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/07/04 20:22:39 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/07/04 20:23:39 Received signal "terminated", shutting down.
2024/07/04 20:23:39 Stopping plugins.
2024/07/04 20:23:39 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
truenas%
My system restarted (seemingly by itself) this morning. Gave me an opportunity to check the logs at the start. I’m unclear whether these all relate to the problem:
Jul 05 10:22:34 truenas k3s[5268]: I0705 10:22:34.395214 5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:22:34 truenas k3s[5268]: E0705 10:22:34.410783 5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:22:35 truenas k3s[5268]: I0705 10:22:35.874063 5268 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=17910588-9606-49ba-8d91-0aa91afd0022 path="/var/lib/kubelet/pods/17910588-9606-49ba-8d91-0aa91afd0022/volumes"
Jul 05 10:22:38 truenas k3s[5268]: I0705 10:22:38.180621 5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="7f3cdecd3c4fafcfd2193365411d76cc2f47fd828b07c0805668693d1a4c7a6a"
Jul 05 10:22:41 truenas k3s[5268]: I0705 10:22:41.906378 5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.021335 5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: not found" containerID="264>
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.022290 5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 26477967f237bf60eefbcd>
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.022834 5268 remote_runtime.go:479] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 26477967f237bf60eefbcd>
Jul 05 10:22:42 truenas k3s[5268]: I0705 10:22:42.193659 5268 scope.go:115] "RemoveContainer" containerID="b0ab4a72cc61abcc041f826c960294764045801b1966b43755728448cd3d7cf9"
Jul 05 10:22:42 truenas k3s[5268]: I0705 10:22:42.193927 5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:42 truenas k3s[5268]: E0705 10:22:42.194366 5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:45 truenas k3s[5268]: I0705 10:22:45.524474 5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:45 truenas k3s[5268]: E0705 10:22:45.524961 5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:46 truenas k3s[5268]: I0705 10:22:46.988786 5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:22:46 truenas k3s[5268]: E0705 10:22:46.989225 5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"tailscale\" with CrashLoopBackOff: \"back-off 10s restarting failed container=tailscale pod=tailscale-78fcb6bc64-pl>
Jul 05 10:22:55 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:22:55.063+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:22:57 truenas k3s[5268]: I0705 10:22:57.868834 5268 scope.go:115] "RemoveContainer" containerID="26477967f237bf60eefbcd24a70ded3c78aa334464dcd4e2495cf66b1c701e84"
Jul 05 10:23:18 truenas k3s[5268]: I0705 10:23:18.779081 5268 scope.go:115] "RemoveContainer" containerID="b8c1ea6b3918c6c9c599abdd9a422009c277731fe2a46893e7b033e244c77bdd"
Jul 05 10:23:19 truenas k3s[5268]: I0705 10:23:19.270719 5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="b8c1ea6b3918c6c9c599abdd9a422009c277731fe2a46893e7b033e244c77bdd"
Jul 05 10:23:19 truenas k3s[5268]: E0705 10:23:19.278882 5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:23:23 truenas k3s[5268]: I0705 10:23:23.286862 5268 scope.go:115] "RemoveContainer" containerID="906da8bc2ca0516526e33a15c83a0b1332ffd183586c793200a30e1b19197dd6"
Jul 05 10:23:28 truenas k3s[5268]: I0705 10:23:28.299271 5268 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="304c48b1636421a8b473cdd18b01763b868e0ee77c749931706fe3e941a1acad"
Jul 05 10:23:29 truenas k3s[5268]: E0705 10:23:29.217156 5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 10s restarting failed container=nvidia-device-plugin-c>
Jul 05 10:23:29 truenas k3s[5268]: I0705 10:23:29.302552 5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"
Jul 05 10:23:29 truenas k3s[5268]: E0705 10:23:29.302801 5268 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 10s restarting failed container=nvidia-device-plugin-c>
Jul 05 10:23:30 truenas k3s[5268]: I0705 10:23:30.304843 5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"
Jul 05 10:23:33 truenas k3s[5268]: I0705 10:23:33.149925 5268 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Jul 05 10:23:35 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:23:35.656+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:24:18 truenas k3s[5268]: {"level":"warn","ts":"2024-07-05T10:24:18.613+0100","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a20c40/kine.sock","attempt":0,"error>
Jul 05 10:24:40 truenas k3s[5268]: E0705 10:24:40.880043 5268 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Jul 05 10:24:45 truenas k3s[5268]: I0705 10:24:45.465778 5268 scope.go:115] "RemoveContainer" containerID="0c6adab672b507b23b7c65a7a752a86f19c860a9123ed0f91e5036fbf0db1953"
jgreen@truenas ~ % sudo k3s kubectl get events --all-namespaces
[sudo] password for jgreen:
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
kube-system 58m Normal SandboxChanged pod/nvidia-device-plugin-daemonset-k58cb Pod sandbox changed, it will be killed and re-created.
kube-system 58m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.34/16] from ix-net
kube-system 52m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.35/16] from ix-net
kube-system 45m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.36/16] from ix-net
kube-system 39m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.37/16] from ix-net
kube-system 34m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.38/16] from ix-net
kube-system 27m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.39/16] from ix-net
kube-system 21m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.40/16] from ix-net
kube-system 16m Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.41/16] from ix-net
kube-system 9m54s Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.42/16] from ix-net
kube-system 8m38s Warning BackOff pod/nvidia-device-plugin-daemonset-k58cb Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-k58cb_kube-system(cbd2f844-0058-43c1-b019-126ae1906411)
kube-system 3m38s Normal Killing pod/nvidia-device-plugin-daemonset-k58cb Stopping container nvidia-device-plugin-ctr
kube-system 3m26s Normal AddedInterface pod/nvidia-device-plugin-daemonset-k58cb Add eth0 [172.16.2.43/16] from ix-net
@LarsR you said you got this working. I’ve been into Apps > Settings and disabled the GPU, Saved and watched the pod (?) disappear. I then went back in and switched it back on. The pod started and appeared fine - I even had my nvidia GPUs to allocate to the app. Then the pod died again. Did yours stay running more than a few minutes?
A quick follow-up. I followed the advice from @LarsR and disabled GPU, then rebooted, then re-enabled and rebooted.
Container lasted a few minutes and GPUs were selectable for Plex. Then, the container died and that was that again. I filed a bug (frankly more of a support request) here with the nvidia driver (I doubt it’s their fault) so if anyone with more expertise would like to assist understanding matters I’d be grateful.
This is not going to be investigated further by the engineers at ix-systems. It appears the problem may be within the K8S subsystems but that’s a complex area and being removed for the next major release of TrueNAS Scale.
I have installed Jailmaker. I used Jailmaker to build a container named docker. I basically followed the procedure in the video on YouTube posted above. Except I used Plex not Jellyfin. Plex can see my nVidia P2000!
Use the shell to access TrueNas scale and confirm you have a /dev/dri/renderD128 (that’s mine, yours may be different). That’s your graphics card recognised by the OS.
Ensure with Jailmaker you volume-mount /dev/dri into your jail:
Finally, with fingers crossed, select your server in Plex and visit the Transcoder page. Hopefully your device will be listed.
The above is clearly something that can be followed by those comfortable with shell access and the editors you use like vim and nano. Hopefully with the next major TrueNAS Scale release the apps infrastructure will make this much simpler for folk needing pure web interface controls.
Another follow-up. Turns out the instructions should work for Jellyfin, but are not sufficient for Plex. Plex will show your hardware, but not use it. Switch on debugging in Plex and you’ll find you’re missing the driver.
The solution is to install the nvidia-container-runtime for Docker to use when launching containers. See this guide.
I finally see the magical “(hw)” when transcoding a stream in Plex and my CPU is no longer used!
How are you running nvidia-smi? Did you install it somehow? Or is it not available in Electric Eel?
I just did a fresh install and wanted to test with the latest App infrastructure, so I updated to Electric Eel. Should I downgrade and then my NVIDIA RTX 4060 will potentially work?