I am trying to replace ollama with vLLM using the vLLM CPU image, but I have issues getting even small models to run, while 4B models were running just fine with ollama. I gave the docker container 8 GB of my 32 GB of memory and the vLLM config is like this, no environment variables set.
Startup of vLLM alway results in the following:
2026-05-15 07:13:30.817212+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.16.3.2 (local), world_size=1, local_world_size=1
2026-05-15 07:13:30.830766+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] OpenMP thread binding info:
2026-05-15 07:13:30.830815+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] local_rank=0, core ids=[6, 7, 8, 9, 10]
2026-05-15 07:13:30.830829+00:00(EngineCore pid=140) INFO 05-15 07:13:30 [ompmultiprocessing.py:180] reserved_cpus=[11]
2026-05-15 07:13:40.820400+00:00INFO 05-15 07:13:40 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
2026-05-15 07:13:40.820487+00:00INFO 05-15 07:13:40 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
2026-05-15 07:13:42.294958+00:00WARNING 05-15 07:13:42 [nixl_utils.py:34] NIXL is not available
2026-05-15 07:13:42.295042+00:00WARNING 05-15 07:13:42 [nixl_utils.py:44] NIXL agent config is not available
2026-05-15 07:13:43.030441+00:00[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
2026-05-15 07:13:45.154204+00:00get_mempolicy: Operation not permitted
2026-05-15 07:13:45.154250+00:00[W515 07:13:45.484541851 utils.cpp:41] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_memory_env)
2026-05-15 07:13:45.154276+00:00set_mempolicy: Operation not permitted
2026-05-15 07:13:45.154288+00:00[W515 07:13:45.484557821 utils.cpp:65] Warning: numa_set_membind failed. errno: 1 (function init_cpu_memory_env)
2026-05-15 07:13:45.155282+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] WorkerProc failed to start.
2026-05-15 07:13:45.155306+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] Traceback (most recent call last):
2026-05-15 07:13:45.155326+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
2026-05-15 07:13:45.155337+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] worker = WorkerProc(*args, **kwargs)
2026-05-15 07:13:45.155364+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155375+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
2026-05-15 07:13:45.155387+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] return func(*args, **kwargs)
2026-05-15 07:13:45.155406+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155418+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 603, in __init__
2026-05-15 07:13:45.155429+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] wrapper.init_worker(all_kwargs)
2026-05-15 07:13:45.155447+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
2026-05-15 07:13:45.155457+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] return func(*args, **kwargs)
2026-05-15 07:13:45.155468+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155485+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 305, in init_worker
2026-05-15 07:13:45.155497+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] self.worker = worker_class(**kwargs)
2026-05-15 07:13:45.155505+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^
2026-05-15 07:13:45.155519+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/cpu_worker.py", line 67, in __init__
2026-05-15 07:13:45.155527+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] raise ValueError(
2026-05-15 07:13:45.155535+00:00ERROR 05-15 07:13:45 [multiproc_executor.py:870] ValueError: Available memory on node 0 (10.42/30.73 GiB) on startup is less than desired CPU memory utilization (0.92, 28.27 GiB). Decrease --gpu-memory-utilization or reduce CPU memory used by other processes.
2026-05-15 07:13:46.358796+00:00(EngineCore pid=140) ERROR 05-15 07:13:46 [core.py:1136] EngineCore failed to start.
From what I understand this looks to me as if the vLLM CPU image wants to allocate 92% of my total memory, instead of 92% of the 8 GB of the container memory.
Can I override this setting with an environment varialbe? Maybe someone can share a working setup.
