Enable the prometheus metrics in Docker

Problem/Justification
The Docker engine can export prometheus metrics about itself. This is configured in its config.json file which is fully (properly) under TrueNAS control, and it’s off. The only ways to turn this on are hacks that violate the appliance boundary of TrueNAS.

Note that this isn’t about “prometheus in TrueNAS”. This is about prometheus anywhere getting metrics for TrueNAS’s core docker engine, the way Docker intends them to be pulled.

Impact
Engine metrics allow direct visibility into container states. That’s particularly important when containers go wrong, such as being stuck in “created” state where they’re largely invisible to cadvisor or similar metrics. Other metrics paths (cadvisor etc.) focus on container resources, not engine state.

Suggested Approaches
Simple: Just add "metrics-addr": "0.0.0.0:9323" to docker’s config.json. Done. Users who don’t care won’t notice. Performance impact is invisible (prometheus is a pull path, so if nobody pulls…) Maybe a few kilobytes of RAM. (9323 is docker’s official prometheus port.)
Better: Add a “Docker metrics address” setting somewhere in Advanced Settings (defaulting to blank or “none”). Paste that into the config.json. Advantage: no default impact at all, more configurable. Drawback: UI change.

Existing Workarounds
A. Hack config.json with a post-boot script. This violates the appliance boundary and requires a stutter-restart of docker engine right when everything’s flapping around during a boot.
B. Rubber-band together some shell scripts with docker CLI and then feed the output into our prometheus pipes.

User Story
Shops using Prometheus for metrics collection want all the prometheus feeds available. Docker is at the core of your app story, so we’re highly sensitive to docker irregularities. The docker-provided metrics are highly useful to create dashboards and alerts about this. A practical example is containers stuck in odd states (like “created”) because something unusual went wrong creating them. These do show up in Portainer logs etc. if you know what to look for, but there’s no clean way to create reliable alerts from there.
Ultimately this is a blind spot in observability for TrueNAS that’s just gratuitously unnecessary. Exposing Docker engine’s own metrics is easy and cheap. Just do it.

3 Likes

I like the concept here but there are some gotchas I’d be a bit stuck on.

Exposing the metrics on 0.0.0.0 would be exposing a plain HTTP, unauthenticated, endpoint. I can’t imagine this being an accepted risk, so additional work would either be needed to integrate the metrics endpoint into the API, or always bind it to loopback and trust operators to expose it safely.

Also, metrics collection is still experimental so it would additionally require running the docker daemon with the experimental flag. I doubt this will sit well considering TrueNAS caters to enterprise customers.

EDIT: You could also just run cAdvisor, which has a prometheus exporter.

Yeah, I know prometheus-in-docker-engine is still “experimental”. It’ll probably stay experimental forever. That doesn’t mean it’s not useful.

I agree that just turning on 0.0.0.0:9323 is suboptimal. I’d strongly prefer a UI affordance (defaulting to “none”) that lets us set the address - 127.0.0.1:9323 if prometheus is running locally, or 0.0.0.0 if we have good firewall/filter rules, or where that makes sense.

Cheers
– perry