Need help or YAML script to get Nvidia Tesla P4 GPU working with Docker container running Ollama

rodneysing · January 13, 2025, 4:18am

I’m trying to pass-thru Nvidia Tesla P4 GPU resources to an Ollama Docker container on TrueNAS EE, since I only have one card and 2 cards are required to isolate and pass-thru to a Linux VM running Ollama.

I’m using the compose script below, with the intention leveraging the GPU:

However, when I test GPU utilization, with sudo watch -n 0.5 nvidia-smi, while using Ollama, the number comes up 0%, and poor token throughput:

If folks have got this working, with a YAML file, using Nvidia Tesla P4, sharing would be much appreciated. I’m even open to loading drivers if that helps, but there appear to be TrueNAS imposed restrictions that I have run into.

Thanks!
-Rodney

Thenish17 · January 13, 2025, 5:35am

never tired using a gpu via yaml i run a tesla m40 but looking at the docks i noticed you missing the gpu-uuid not sure how important it is put i assume it is example can be found at “Installing Custom Applications | TrueNAS Documentation Hub” jsut head to the nvdia gpu example also check your version before 24.10.1 their was an nvida bug wich caused issues while using apps forum is at “NVIDIA GPU Not Being Used By Apps - ElectricEel” the problem with that one was the uuid wich is why i think that might matter

rodneysing · January 14, 2025, 1:40am

Thank you Thenish17!

I’ll give it a shot. I remember seeing related posts, but not in the detail you have shared. So are you using your GPU for transcoding, or something else? Is this working well for you?

Thanks agin!
-Rodney

rodneysing · January 15, 2025, 1:44am

Thenish17,

So I was intentional, given your guidance, making sure to put the installed GPU ID in the YAML:

Things still didn’t work, as all 12 system cores throttled to over 100% and the temperature spiked to over 80 degrees C.

The use case I’m running, I’ll note is a little different than what a lot of folks do, perhaps. Typically I see users wanting to accelerate for video transcoding. For that use case requires a different set of installed drivers, as I’ve learned recently.

And Plex doesn’t seem to support use of Nvidia GPU cards, so I’ve installed Jellyfin instead for serving up movies. But it’s also worth noting that if one gets Jellyfin to transcode, using an Nvidia GPU card from a container, that doesn’t mean that acceleration on an Ollama LLM container will work with the same GPU.

Turns out CUDA drivers (the toolkit) are not required to transcode, while Ollama LLM acceleration is a CUDA workload that requires the CUDA toolkit be installed.

While it’s possible to control the GPU card VIA the installed driver and get transcoding to work, it appears the current installation of TrueNAS EE does not include the Nvidia CUDA toolkit, which is essential to running CUDA workloads like Ollama LLMs. This can be seen by the following command: nvcc --version. Also, base level driver control can be demonstrated by changing persistence mode with the following: sudo nvidia-smi -pm 1.

The first command demonstrates whether the CUDA drivers are installed, while the second command sets or unsets the idle condition.

It seems to the way to fix things is with the following command, however, apt is not supported on TrueNAS Scale:

sudo apt-get install nvidia-cuda-toolkit

Guessing the big question is, can users lobby these drivers be added to a build, or are these drivers already be there and a defect should be logged?

Thanks for your help!
-Rodney

Thenish17 · January 15, 2025, 9:56pm

hey, apologies for the late i got my m40 working via yaml deploy
“”"
services:
ollama:
container_name: ollama-local
deploy:
resources:
reservations:
devices:
- capabilities:
- gpu
count: 1
driver: nvidia
environment:
NVIDIA_DRIVER_CAPABILITIES: compute, utility
NVIDIA_VISIBLE_DEVICES: all
image: ollama/ollama:latest
ports:
- ‘11434:11434’
restart: unless-stopped
runtime: nvidia

“”"
check the container log you should see your gpu populate

it should look like that as for jelly fin it should work as well altheout the longs dont show its detected
the telsa m40 i have dos’t supoer AV1 so i locks up

but everthing else works

and i sue my m40 for whater comfy-ui, jellfin, ollama, openwebu, immich jsut depends on waht i can use it for

rodneysing · January 16, 2025, 12:08am

No worries and thank you, kindly! Curious to know when you run llama2 and prompt with say “why is the sky blue?”, noting if CPU utilization and core temperatures all throttle? Also note, if no throttling occurs, does the workload appear on the GPU for the prompt, as seen by sudo watch -n 0.5 nvidia-smi, going from zero to some percentage…

While I’m able to get the script to run, I don’t see a throughput improvement when I ask llama2 the same prompt. Wondering what your experience would be.

Thanks again!
-Rodney

Thenish17 · January 16, 2025, 4:59am

in smi if your running ollama or any service it should look like this

else your gpu is not untilized
can you inspect your ollama contaer
type “docker inspect container_name”
check if the nvdia device is visable to the container it should look like

if it doesn’t look like that can you nano the compose file of the ollama deployment
its gonna be found in the dir listed in your inspect resualy should look like this

and the command is gonna look like “nano /mnt/.ix-apps/app_configs/ollama/versions/1.0.23/templates/rendered/docker-compose.yaml” your results are gonna look like

if you don’t have the gpu listed that’s why its not working because I’m slow at replying cd into “/proc/driver/nvidia/gpus” than into the gpu its gonna be like “0000:01:00.0” and type “cat information” results should look like

copy the gpu uuid and place it in the correct section along with you ollama app name

“”"
midclt call -job app.update YOUR_APP_NAME ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:30:00.0”: {“use_gpu”: true, “uuid”: “YOUR-GPUUUID”"}}}}}}’

“”"

finished command should look like
“”"

midclt call -job app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:10:00.0”: {“use_gpu”: true, “uuid”: “GPU-008d5ebe-9cb2-5984-7ccf-b64b43616e26”}}}}}}’
“”"
midclt call -job app.update jellyfin ‘{“values”: {“resources”: {“gpus”: {“use_all_gpus”: false, “nvidia_gpu_selection”: {“0000:10:00.0”: {“use_gpu”: true, “uuid”: “GPU-008d5ebe-9cb2-5984-7ccf-b64b43616e26”}}}}}}’

and that should work let me know if it works

Thenish17 · January 16, 2025, 5:09am

hey, i was stalking your page make sure you don’t have the gpu isolated you can not isolate a gpu for a vm and use it for applications its one or the other to my knowledge

rodneysing · January 16, 2025, 8:08am

Thanks! Not isolated. But given your help, thinking I have much bigger problems, as you have pointed me to logs that have been very helpful, but unfortunately have not been looking at. Will update later this morning. Much appreciate the stalking and you staying tuned…

Thanks!
-Rodney

ChanningHe · January 16, 2025, 8:30am

Actually, you can modify the source code to package the NVIDIA GRID driver module used by SCALE EE, and you don’t need install-dev-tool. Just replace the original module, then check the app setting to load the driver, and it will work like this.

PeterKse · January 16, 2025, 1:09pm

For what its worth. I tested your YAML script just copy paste, and when chatting with Ollama:

(dont mind the temperature, there’s a new fan coming any day now )

what does your /dev/dri folder look like?

If this is not what you see, then run:
modprobe nvidia_drm and then do ldconfig

this will also cure your problem with Plex

rodneysing · March 25, 2025, 12:01am

Thanks to everybody that contributed and helped. The key to solving this problem was in the logs, like many have been pointing out. The really important detail in the log that I was missing was this:

“level=INFO source=common.go:131 msg=“GPU runner incompatible with host system, CPU does not have AVX” runner=cuda_v11_avx”

Digging a bit deeper revealed the CPU on my 11th generation Dell server was not compatible with the CUDA-based GPU runner. To be compatible the CPU has to support AVX (Advanced Vector Extensions).

The 11th generation server I was trying to get all of this to work on had a Xeon X5670 CPU – makes for a great storage server, but not so great for running LLM, like Ollama.

The issues at hand was the CPU architecture of my 11th generation Dell server was the Westmere-EP microarchitecture, 2010. AVX support was introduced in Intel’s Sandy Bridge architecture in 2011.

Turns out any 12th generation Dell server and up will work. I’ve confirmed this by recently adding a 13th generation Dell server to my homelab and Ollama works great with the Nividia Tesla P4 – snappy fast!

Stay tuned for my next post where I’m asking: how the devil do I get more than one Nvidia Tesla P4 to work on an LLM workload running in an Ollama container?