Issues running even small models using vLLM and CPU image

Noki · May 15, 2026, 7:18am

I am trying to replace ollama with vLLM using the vLLM CPU image, but I have issues getting even small models to run, while 4B models were running just fine with ollama. I gave the docker container 8 GB of my 32 GB of memory and the vLLM config is like this, no environment variables set.

Startup of vLLM alway results in the following:

2026-05-15 07:13:30.817212+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.16.3.2 (local), world_size=1, local_world_size=1
2026-05-15 07:13:30.830766+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] OpenMP thread binding info: 
2026-05-15 07:13:30.830815+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] 	local_rank=0, core ids=[6, 7, 8, 9, 10]
2026-05-15 07:13:30.830829+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] 	reserved_cpus=[11]
2026-05-15 07:13:40.820400+00:00INFO 05-15 07:13:40 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
2026-05-15 07:13:40.820487+00:00INFO 05-15 07:13:40 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
2026-05-15 07:13:42.294958+00:00WARNING 05-15 07:13:42 [nixl_utils.py:34] NIXL is not available
2026-05-15 07:13:42.295042+00:00WARNING 05-15 07:13:42 [nixl_utils.py:44] NIXL agent config is not available
2026-05-15 07:13:43.030441+00:00[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
2026-05-15 07:13:45.154204+00:00get_mempolicy: Operation not permitted
2026-05-15 07:13:45.154250+00:00[W515 07:13:45.484541851 utils.cpp:41] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_memory_env)
2026-05-15 07:13:45.154276+00:00set_mempolicy: Operation not permitted
2026-05-15 07:13:45.154288+00:00[W515 07:13:45.484557821 utils.cpp:65] Warning: numa_set_membind failed. errno: 1 (function init_cpu_memory_env)
2026-05-15 07:13:45.155282+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] WorkerProc failed to start.
2026-05-15 07:13:45.155306+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] Traceback (most recent call last):
2026-05-15 07:13:45.155326+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
2026-05-15 07:13:45.155337+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
2026-05-15 07:13:45.155364+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155375+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
2026-05-15 07:13:45.155387+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     return func(*args, **kwargs)
2026-05-15 07:13:45.155406+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155418+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 603, in __init__
2026-05-15 07:13:45.155429+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     wrapper.init_worker(all_kwargs)
2026-05-15 07:13:45.155447+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
2026-05-15 07:13:45.155457+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     return func(*args, **kwargs)
2026-05-15 07:13:45.155468+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155485+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 305, in init_worker
2026-05-15 07:13:45.155497+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     self.worker = worker_class(**kwargs)
2026-05-15 07:13:45.155505+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]                   ^^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155519+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/cpu_worker.py", line 67, in __init__
2026-05-15 07:13:45.155527+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870]     raise ValueError(
2026-05-15 07:13:45.155535+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ValueError: Available memory on node 0 (10.42/30.73 GiB) on startup is less than desired CPU memory utilization (0.92, 28.27 GiB). Decrease --gpu-memory-utilization or reduce CPU memory used by other processes.
2026-05-15 07:13:46.358796+00:00(EngineCore pid=140) ERROR 05-15 07:13:46 [core.py:1136] EngineCore failed to start.

From what I understand this looks to me as if the vLLM CPU image wants to allocate 92% of my total memory, instead of 92% of the 8 GB of the container memory.

Can I override this setting with an environment varialbe? Maybe someone can share a working setup.

Topic		Replies	Views
Ollama model not starting because of cached ram Apps and Virtualization SCALE , ZFS	2	3326	February 11, 2025
Need help or YAML script to get Nvidia Tesla P4 GPU working with Docker container running Ollama Apps and Virtualization	16	1252	August 31, 2025
Trying to get multiple Nvidia Tesla P4 GPU cards working on LLM workloads in a Ollama container Apps and Virtualization	1	921	April 21, 2025
Ollama ROCm performance issues in apps Apps and Virtualization Performance , Apps , Docker , AI , GPU	0	233	October 25, 2025
Ollama IX community container does not support GPU pass thru? Apps and Virtualization	1	533	November 14, 2024

Issues running even small models using vLLM and CPU image

Related topics