Trying to get multiple Nvidia Tesla P4 GPU cards working on LLM workloads in a Ollama container

rodneysing · March 25, 2025, 12:50am

Using the following yaml code I am able to get better performance from an LLM running in an Ollama container:

version: ‘3.7’

services:
ollama:
image: ollama/ollama:latest
container_name: ollama-local
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidia1:/dev/nvidia1 # Add more lines if more GPUs are present
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
environment:
- NVIDIA_VISIBLE_DEVICES=none # Using explicit device mapping
volumes:
- /mnt/window_share/Apps/Ollama/data:/data
- /mnt/window_share/Apps/Ollama/config:/config
ports:
- “11434:11434”
restart: unless-stopped

version: ‘3.8’

services:
ollama:
image: ollama/ollama:latest # Replace with the specific Ollama image if needed
container_name: ollama-local
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
volumes:
- /mnt/window_share/Apps/Ollama/data:/data
- /mnt/window_share/Apps/Ollama/config:/config
ports:
- “11434:11434” # Default Ollama API port
restart: unless-stopped

Both cases work. However, when I look at how the GPU resources are utilized with sudo watch -n 0.5 nvidia-smi, one CPU rails utilization to 98%, while the other GPU remains at 0%.

Has anybody els had this experience?

Thanks!
-Rodney

rodneysing · April 21, 2025, 11:52pm

Hi Steve!

Thanks for the response. What I discovered was that the number of GPUs divide up how large the entire LLM, memory-wise consumes. So, 3 Tesla P4 GPUs have a total RAM size of just shy of 24GB. So a model of this size, but not more, will break up the size and parallel process using all 3 GPU resources. It’s not faster, but you can take on larger LLM model sizes.

Thanks again!
-Rodney