Trying to get multiple Nvidia Tesla P4 GPU cards working on LLM workloads in a Ollama container

Using the following yaml code I am able to get better performance from an LLM running in an Ollama container:

version: ‘3.7’

services:
ollama:
image: ollama/ollama:latest
container_name: ollama-local
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidia1:/dev/nvidia1 # Add more lines if more GPUs are present
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
environment:
- NVIDIA_VISIBLE_DEVICES=none # Using explicit device mapping
volumes:
- /mnt/window_share/Apps/Ollama/data:/data
- /mnt/window_share/Apps/Ollama/config:/config
ports:
- “11434:11434”
restart: unless-stopped

version: ‘3.8’

services:
ollama:
image: ollama/ollama:latest # Replace with the specific Ollama image if needed
container_name: ollama-local
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
volumes:
- /mnt/window_share/Apps/Ollama/data:/data
- /mnt/window_share/Apps/Ollama/config:/config
ports:
- “11434:11434” # Default Ollama API port
restart: unless-stopped

Both cases work. However, when I look at how the GPU resources are utilized with sudo watch -n 0.5 nvidia-smi, one CPU rails utilization to 98%, while the other GPU remains at 0%.

Has anybody els had this experience?

Thanks!
-Rodney

I have seen similar behavior with multi-GPU Docker containers where only one GPU gets loaded heavily while others stay idle. Sometimes it is related to how the containerized application distributes workloads internally — even if the drivers and devices are mapped correctly. It might help to double-check if Ollama supports true multi-GPU parallelism out of the box or if some manual balancing or additional tuning (like separate model replicas per GPU) is needed.

1 Like

Hi Steve!

Thanks for the response. What I discovered was that the number of GPUs divide up how large the entire LLM, memory-wise consumes. So, 3 Tesla P4 GPUs have a total RAM size of just shy of 24GB. So a model of this size, but not more, will break up the size and parallel process using all 3 GPU resources. It’s not faster, but you can take on larger LLM model sizes.

Thanks again!
-Rodney