TrueNAS Scale 25.10 running Ollama AI local models on older PowerEdge systems like r720 128 GB RAM with RTX 5070 12 GB VRAM

Wooha! TrueNAS Scale 25.10 running Ollama AI local models on my Old PowerEdge r720 128 GB RAM with RTX 5070 12 GB VRAM.

Stats:

total duration: 13.47505831s
load duration: 391.896386ms
prompt eval count: 51 token(s)
prompt eval duration: 27.646026ms
prompt eval rate: 1844.75 tokens/s
eval count: 709 token(s)
eval duration: 11.208734619s
eval rate: 63.25 tokens/s

Full request:

ollama run qwen2.5:14b “Write a detailed 600-word blog post about why Blackwell GPUs are excellent for local AI inference.” --verbose

Why Blackwell GPUs Are Excellent for Local AI Inference

The rapid advancement of artificial intelligence (AI) has led to an explosion in the demand for
powerful hardware that can handle complex tasks such as machine learning, deep learning, and other
advanced computations at the edge or on-device. Among these devices, NVIDIA’s latest GPU
generation, known as Blackwell GPUs, stands out due to their exceptional capabilities tailored for
local AI inference. Here are several reasons why these GPUs have become an indispensable part of
modern AI ecosystems.

1. Superior Performance

Blackwell GPUs offer unparalleled performance thanks to their advanced architecture and increased
core count compared to previous generations like the Ampere series. These GPUs leverage the latest
NVIDIA CUDA cores, which enable faster execution of parallel computations—a critical aspect for
real-time inference in edge devices. The enhanced tensor cores further accelerate matrix
operations commonly used in deep learning models, making them highly efficient for AI workloads.

2. Power Efficiency

Energy consumption is a significant concern with traditional GPUs, especially when deployed in
mobile or embedded systems where power budgets are tight. Blackwell GPUs address this issue by
offering significantly improved power efficiency through architectural optimizations and advanced
manufacturing processes (such as TSMC’s N4 process technology). This means they can deliver high
performance without compromising on battery life, making them ideal for devices like smartphones,
tablets, and edge servers.

3. Enhanced AI Inference Capabilities

Blackwell GPUs come equipped with NVIDIA’s latest TensorRT software suite, which is specifically
designed to optimize the deployment of deep learning models in real-world applications. The
inclusion of new features such as DLBoost and INT4 data types helps reduce latency while
maintaining high accuracy levels, thus ensuring smooth user experiences across various AI-driven
services.

4. Versatility Across Use Cases

From healthcare diagnostics through computer vision applications, to autonomous vehicles requiring
continuous processing at the edge, Blackwell GPUs cater to a wide range of use cases due to their
versatility and scalability. They support both training and inference operations, allowing
developers to fine-tune models locally before deploying them in production environments. This
flexibility is crucial for industries where rapid prototyping and iterative development cycles are
commonplace.

5. Integration with Ecosystems

NVIDIA’s strong ecosystem presence ensures that Blackwell GPUs work seamlessly across various
software frameworks and tools commonly used in AI projects. Compatibility with TensorFlow,
PyTorch, ONNX (Open Neural Network Exchange), and other platforms simplifies the deployment
process for developers while providing robust support for diverse use cases.

6. Security Enhancements

In an era where data security is paramount, Blackwell GPUs introduce hardware-based protections
designed to safeguard sensitive information processed during AI inference tasks. Features like
secure boot, encrypted memory, and virtualization support ensure that confidential data remains
protected throughout its lifecycle within the GPU environment.

Conclusion

The advent of Blackwell GPUs represents a significant leap forward in local AI processing
capabilities. Their combination of superior performance, power efficiency, advanced AI-specific
optimizations, versatility across multiple applications, seamless integration with existing
ecosystems, and enhanced security features make them an ideal choice for businesses looking to
leverage cutting-edge AI technologies without the need for cloud connectivity or extensive
infrastructure investments. As more industries adopt on-device intelligence solutions, Blackwell
GPUs are poised to play a pivotal role in driving innovation forward.

In summary, whether you’re developing AI-driven applications for consumer electronics, automotive
systems, or enterprise environments, integrating Blackwell GPUs into your projects can offer
unparalleled benefits in terms of speed, efficiency, and functionality.

total duration: 13.47505831s
load duration: 391.896386ms
prompt eval count: 51 token(s)
prompt eval duration: 27.646026ms
prompt eval rate: 1844.75 tokens/s
eval count: 709 token(s)
eval duration: 11.208734619s
eval rate: 63.25 tokens/s

Can you run “gpt-oss:120b”? The model will fit in the 128GB RAM but I am currious how ZFS will handle the memory?

No. It’s about 63 GiB so You would need much more GPU NVRAM than this card have; only 12 GB.
However if You need this model use this model:
gpt-oss:120b-cloud
You can run it from whatever HW supporting Ollama,… but it will not be local!

I am running “gpt-oss:120b” on my PC (Windows 11 with WSL) with RTX2080 8GB VRAM and I have no issue with it. The RAM does jump about 60+GB and does use part of the GPU.

How many T/s please.

Running the same query with “gpt-oss:120b” I get:

When running:

ollama ps

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b a951a23b46a1 66 GB 96%/4% CPU/GPU 4096 4 minutes from now

Show me please Your stats for:

ollama run qwen2.5:14b “Write a detailed 600-word blog post about why Blackwell GPUs are excellent for local AI inference.” --verbose

This will be Apple to Apple comparison.

total duration:       58.530971809s
load duration:        56.933057ms
prompt eval count:    52 token(s)
prompt eval duration: 227.277638ms
prompt eval rate:     228.80 tokens/s
eval count:           709 token(s)
eval duration:        57.889058566s
eval rate:            12.25 tokens/s

Awesome!
Thank You!

My PowerEdge R720 DDR3 first run, load from HDD:

total duration: 1m15.849104532s
load duration: 1m0.833578187s
prompt eval count: 52 token(s)
prompt eval duration: 43.207797ms
prompt eval rate: 1203.49 tokens/s
eval count: 806 token(s)
eval duration: 12.786673117s
eval rate: 63.03 tokens/s

Second run from PowerEdge R720 DDR3 after first run load:

total duration: 12.859586207s
load duration: 431.959732ms
prompt eval count: 52 token(s)
prompt eval duration: 53.396241ms
prompt eval rate: 973.85 tokens/s
eval count: 667 token(s)
eval duration: 10.555527457s
eval rate: 63.19 tokens/s

My PC DDR4 first load:

total duration: 1m9.0704993s
load duration: 57.4703324s
prompt eval count: 52 token(s)
prompt eval duration: 209.4361ms
prompt eval rate: 248.29 tokens/s
eval count: 647 token(s)
eval duration: 8.4419928s
eval rate: 76.64 tokens/s

My PC DDR4 second run after the initial load:

total duration: 12.3965363s
load duration: 153.2448ms
prompt eval count: 52 token(s)
prompt eval duration: 159.1831ms
prompt eval rate: 326.67 tokens/s
eval count: 689 token(s)
eval duration: 9.8864056s
eval rate: 69.69 tokens/s

It is still not apples to apples since “eval count” is different between each machine/tests even same machine,… but it is giving straight direction where is T/S cost optimum in local AI inferences.
Just GPU with biggest possible VRAM.
I’know,… I’m not first to discover it, but it is nice to see where You can get money savings and get same or better results with less money spent.

The cherry on the cake is here:

ollama run gpt-oss:120b-cloud “Write a detailed 600-word blog post about why Blackwell GPUs are excellent for local AI inference. Translate it after to Polish” --verbose

total duration: 18.610718459s
prompt eval count: 94 token(s)
eval count: 2952 token(s)

The 18 seconds with same request as above. So!
If You do not 24/7 inference all the time,… save Your money on hardware and enyoj free Ollama cloud.

Why are you using:

gpt-oss:120b-cloud

You are requesting ollama to run from the cloud, not locally.

run it as:

ollama run gpt-oss:120b “Write a detailed 600-word blog post about why Blackwell GPUs are excellent for local AI inference. Translate it after to Polish” --verbose

Sorry for confusing. Yes, primarily I want to see local inference, have some idea which would require lot’s of inference on local machines and do no spent money on paid API.
… but I wanted to see what Ollama cloud is offering in cloud layer.

Will you be attempting to run gpt-oss:120b on your system?

No.
18 seconds on Ollama Cloud gpt-oss:120b is not worth for me to justify hardware $$$.

What I’m looking for and I’m sure it will happen this Year is the current open models faster and faster on todays hardware with less VRAM locally.
I’m pretty sure it will be optimized on software layer this 2026 Year.
The $$$ gap is to big between big cloud expensive models and local inferences.

This is insane!!!
I was wrong! Not Software, Hardware. Look at this:
AI inference 1480% faster then existing best AI NVIDIA GPUs.
I’m returning my NVIDIA 2xRTX 5070 back to NVIDIA.
Generated in 0.051s • 15,763 tok/s

Here is why:
My small Ollama AI performance test model: llama3.1:8b
“Write a detailed 600-word blog post about why Blackwell GPUs are excellent for local AI inference.” --verbose
Result on my local hardware:
Generated in 7.5422897s • 133.93 tokens/s

Now Apple to Apple same model but run on Taalas llama3.1:8b burned on their silicon bone.
Generated in 0.051s • 15,763 tok/s

… the direction is VERY CLEAR!

… it’s amazing nobody notice it,…