Prometheus Exporter

Problem/Justification
Prometheus and Grafana are well know monitoring system. Now we do add an converter and mapper to scrape metrics for prometheus.

Impact
A user who want scrape truenas metrics can do it easily.

Advantage : no need to put converter somewhere for scrape metrics.
Disadvantage : truenas team must code it

User Story

  1. the user choice metrics in a check-tree for scrapping
  2. the user choice which IP can scrape this metrics
  3. the user enable it and finnish

can prometheus ingest data from graphite? If yes truenas can already add a graphite exporter. Another option would be intsead of prometheus to use the graphite exporter to influxdb and then to grafana.

The real problem right now is that the system provides too few metric data points—many indicators are missing.

A graphite exporter mapping file for truenas scale >23.10.1 metrics and some example grafana dashboards

Going forward from exporter configuration version 2.1, you must also use the Netdata configuration included in the repository to restore the pre‑25.04 metrics. Instructions are provided below. This action is taken since TrueNAS 25.04 dropped a lot of default metrics which doesn’t make sense for me.

No, you need a converter graphite to prometheus.
Something like that : GitHub - prometheus/graphite_exporter: Server that accepts metrics via the Graphite protocol and exports them as Prometheus metrics

You have right, very few data points.

But I have see in /etc/netdata/netdata.conf, you can enable more metrics.
I don’t understand why they have not implemented this netdata plugin : ZFS Pools | Learn Netdata.

Yes, I have see and test the work to suppoterino.
But do you think it is normal to do all this stuff for only scrape metrics ?

why in something like octoprint, 2 click and it is finnish ?
A 3D printer software (Octoprint) is more monitor friendly than a NAS pro software (TrueNAS), why ?

It seems to be due to licensing issues? Although I don’t know why netdata had to be chosen in the first place, and then some things started being removed because of other problems

where have you see the licensing problem ?

But I think a serious software pro need metrics to be monitored.

A quick and dirty way to have prometheus metrics :

  1. on truenas go to /etc/netdata/netdata.conf and replace “127.0.0.1” by “0.0.0.0” or IP of your monitoring system. restart netdata service.
  2. check with internet browser http://TruenasIP:6999/api/v1/allmetrics?format=prometheus&types=yes&help=yes&timestamps=no

i have realized it with alloy running as app:

configs:
  alloy_config:
    content: |
      // ===== SYSTEM METRICS EXPORT =====
      prometheus.exporter.unix "integrations_node_exporter" {
        procfs_path          = "/host/proc"
        sysfs_path           = "/host/sys"
        include_exporter_metrics = true
        enable_collectors        = ["systemd"]
        disable_collectors       = ["mdadm"]
      }
      discovery.relabel "integrations_node_exporter" {
        targets = prometheus.exporter.unix.integrations_node_exporter.targets
        rule {
          target_label = "job"
          replacement  = "integrations/node_exporter"
        }
        rule {
          replacement  = env("INSTANCE_NAME")
          target_label = "instance"
        }
      }
      prometheus.scrape "integrations_node_exporter" {
        targets         = discovery.relabel.integrations_node_exporter.output
        forward_to      = [otelcol.receiver.prometheus.metrics.receiver]
        job_name        = "integrations/node_exporter"
        scrape_interval = "15s"
      }
      // ===== CADVISOR (DOCKER CONTAINER METRICS) =====
      prometheus.exporter.cadvisor "docker" {
        docker_host      = "unix:///var/run/docker.sock"
        storage_duration = "5m"
      }
      discovery.relabel "cadvisor" {
        targets = prometheus.exporter.cadvisor.docker.targets
        rule {
          replacement  = env("INSTANCE_NAME")
          target_label = "instance"
        }
      }
      prometheus.scrape "cadvisor" {
        targets         = discovery.relabel.cadvisor.output
        forward_to      = [prometheus.relabel.cadvisor_fix_instance.receiver]
        job_name        = "integrations/cadvisor"
        scrape_interval = "15s"
      }
      // ZUSÄTZLICHER RELABEL STEP für cAdvisor
      prometheus.relabel "cadvisor_fix_instance" {
        forward_to = [otelcol.receiver.prometheus.metrics.receiver]
        // instance Label überschreiben falls vorhanden
        rule {
          replacement  = env("INSTANCE_NAME")
          target_label = "instance"
        }
      }
      // ===== ALLOY SELF METRICS =====
      prometheus.exporter.self "integrations_agent" { }
      discovery.relabel "integrations_agent" {
        targets = prometheus.exporter.self.integrations_agent.targets
        rule {
          replacement  = env("INSTANCE_NAME")
          target_label = "instance"
        }
        rule {
          target_label = "job"
          replacement  = "integrations/alloy"
        }
      }
      prometheus.scrape "integrations_agent" {
        targets         = discovery.relabel.integrations_agent.output
        forward_to      = [otelcol.receiver.prometheus.metrics.receiver]
        job_name        = "integrations/alloy"
        scrape_interval = "15s"
      }
      prometheus.scrape "smartctl" {
        targets = [
          {"__address__" = "192.168.0.209:9633", "instance"="storage"},
        ]
        forward_to = [otelcol.receiver.prometheus.metrics.receiver]
        scrape_interval = "15s"
        job_name = "smartctl"
      }
      // ===== OTEL RECEIVER FÜR METRICS =====
      otelcol.receiver.prometheus "metrics" {
        output {
          metrics = [otelcol.processor.attributes.metrics.input]
        }
      }
      otelcol.processor.attributes "metrics" {
        action {
          key    = "cloud.region"
          value  = env("CLOUD_REGION")
          action = "upsert"
        }
        action {
          key    = "cluster"
          value  = env("CLUSTER_NAME")
          action = "upsert"
        }
        // WICHTIG: instance nochmal forcieren
        action {
          key    = "instance"
          value  = env("INSTANCE_NAME")
          action = "upsert"
        }
        output {
          metrics = [otelcol.processor.batch.default.input]
        }
      }
      // ===== SYSTEM LOGS (JOURNAL) =====
      discovery.relabel "journal" {
        targets = []
        rule {
          source_labels = ["__journal__systemd_unit"]
          target_label  = "service_name"
          regex         = "(.+)\\.service"
          replacement   = "$1"
        }
        rule {
          source_labels = ["__journal_priority_keyword"]
          target_label  = "level"
        }
        rule {
          source_labels = ["__journal__transport"]
          target_label  = "transport"
        }
        rule {
          replacement  = env("INSTANCE_NAME")
          target_label = "instance"
        }
      }
      loki.source.journal "journal" {
        max_age       = "12h0m0s"
        relabel_rules = discovery.relabel.journal.rules
        forward_to    = [loki.process.journal.receiver]
        labels        = {
          job = "systemd-journal",
        }
      }
      loki.process "journal" {
        forward_to = [loki.write.logs_default.receiver]
        stage.match {
          selector = "{service_name=\"alloy\"}"
          stage.logfmt {
            mapping = {
              timestamp = "ts",
              log_level = "level",
              message   = "msg",
              service   = "service",
            }
          }
          stage.timestamp {
            source = "timestamp"
            format = "RFC3339Nano"
          }
          stage.label_drop {
            values = ["job"]
          }
          stage.static_labels {
            values = {
              job = "integrations/alloy",
            }
          }
          stage.labels {
            values = {
              level = "log_level",
            }
          }
        }
        stage.match {
          selector = "{transport=\"kernel\"}"
          stage.static_labels {
            values = {
              service_name = "kernel",
            }
          }
        }
      }
      // ===== DOCKER CONTAINER LOGS =====
      discovery.docker "containers" {
        host = "unix:///var/run/docker.sock"
      }
      loki.source.docker "containers" {
        host       = "unix:///var/run/docker.sock"
        targets    = discovery.docker.containers.targets
        forward_to = [loki.process.docker_logs.receiver]
      }
      loki.process "docker_logs" {
        forward_to = [loki.write.logs_default.receiver]
      }
      // ===== LOKI WRITE =====
      loki.write "logs_default" {
        endpoint {
          url = env("LOKI_ENDPOINT")
          headers = {
            "X-Scope-OrgID" = "langerma",
          }
        }
        external_labels = {
          cloud_region = env("CLOUD_REGION"),
          cluster      = env("CLUSTER_NAME"),
          instance     = env("INSTANCE_NAME"),
        }
      }
      // ===== TRACING =====
      tracing {
        sampling_fraction = 0.1
        write_to          = [otelcol.processor.batch.default.input]
      }
      // ===== BATCH PROCESSOR =====
      otelcol.processor.batch "default" {
        send_batch_size     = 1000
        send_batch_max_size = 1000
        timeout             = "200ms"
        output {
          traces  = [otelcol.exporter.otlphttp.cloud_traces.input]
          metrics = [otelcol.exporter.otlphttp.cloud_metrics.input]
          logs    = [otelcol.exporter.otlphttp.cloud_logs.input]
        }
      }
      // ===== OTLP EXPORTERS =====
      otelcol.exporter.otlphttp "cloud_traces" {
        client {
          endpoint = env("TEMPO_ENDPOINT")
          headers = {
            "X-Scope-OrgID" = "langerma",
          }
        }
        sending_queue {
          num_consumers = 30
          queue_size    = 1000
        }
      }
      otelcol.exporter.otlphttp "cloud_metrics" {
        client {
          endpoint = env("MIMIR_ENDPOINT")
          headers = {
            "X-Scope-OrgID" = "langerma",
          }
        }
        sending_queue {
          num_consumers = 30
          queue_size    = 1000
        }
      }
      otelcol.exporter.otlphttp "cloud_logs" {
        client {
          endpoint = env("LOKI_OTLP_ENDPOINT")
          headers = {
            "X-Scope-OrgID" = "langerma",
          }
        }
        sending_queue {
          num_consumers = 30
          queue_size    = 1000
        }
      }
      // ===== OTLP RECEIVER =====
      otelcol.receiver.otlp "local" {
        grpc {
          endpoint = "0.0.0.0:4317"
        }
        http {
          endpoint = "0.0.0.0:4318"
        }
        output {
          metrics = [otelcol.processor.batch.default.input]
          logs    = [otelcol.processor.batch.default.input]
          traces  = [otelcol.processor.batch.default.input]
        }
      }
services:
  alloy:
    command:
      - run
      - /etc/alloy/config.alloy
      - '--server.http.listen-addr=0.0.0.0:12345'
      - '--storage.path=/var/lib/alloy/data'
    configs:
      - source: alloy_config
        target: /etc/alloy/config.alloy
    container_name: alloy
    environment:
      CLOUD_REGION: somewhere
      CLUSTER_NAME: somecluster
      INSTANCE_NAME: somename
      LOKI_ENDPOINT: https://someloki/loki/api/v1/push
      LOKI_OTLP_ENDPOINT: https://someloki/otlp
      MIMIR_ENDPOINT: https://somemimir/otlp
      NVIDIA_VISIBLE_DEVICES: void
      TEMPO_ENDPOINT: https://sometempo/otlp
      TZ: Europe/Vienna
    group_add:
      - '568'
    image: grafana/alloy:latest
    network_mode: bridge
    platform: linux/amd64
    ports:
      - '12345:12345'
      - '4317:4317'
      - '4318:4318'
    privileged: True
    pull_policy: always
    restart: unless-stopped
    stdin_open: False
    tty: False
    user: '0:0'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/host/root:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - alloy-data:/var/lib/alloy/data
version: '3.8'
volumes:
  alloy-data: Null
x-portals:
  - host: 0.0.0.0
    name: Web UI
    path: /
    port: 12345
    scheme: http

and also smartctl exporter:

services:
  smartctl-exporter:
    environment:
      NVIDIA_VISIBLE_DEVICES: void
      TZ: Europe/Vienna
    group_add:
      - 568
    image: prometheuscommunity/smartctl-exporter:latest
    network_mode: host
    platform: linux/amd64
    privileged: True
    pull_policy: always
    restart: unless-stopped
    stdin_open: False
    tty: False
    user: '0:0'
x-notes: >
  # iX App
  ## Security
  **Read the following security precautions to ensure that you wish to continue
  using this application.**
  ---
  ### Container: [smartctl-exporter]
  #### Privileged mode is enabled
  - Has the same level of control as a system administrator
  - Can access and modify any part of your TrueNAS system
  #### Running user/group(s)
  - User: root
  - Group: root
  - Supplementary Groups: apps
  #### Security option [no-new-privileges] is not set
  - Processes can gain additional privileges through setuid/setgid binaries
  - Can potentially allow privilege escalation attacks within the container
  ---
  version: '3.8'
  services:
    smartctl_exporter:
      image: prometheuscommunity/smartctl-exporter:latest
      container_name: smartctl-exporter
      network_mode: host
      user: root
      devices:
        - "/dev:/dev"  # Nur das NVMe-Gerät
      restart: unless-stopped
  ## Bug Reports and Feature Requests
  If you find a bug in this app or have an idea for a new feature, please file
  an issue at
  https://github.com/truenas/apps
x-portals:
  - host: 0.0.0.0
    name: Web UI
    path: /metrics
    port: 9633
    scheme: http

maybe someone finds it useful: feedback very much appriciated

There is also SNMP. It’s a bit old fashioned but if you only need basic monitoring for pool health, cpu usage, network usage, disk temperatures, etc. it does the job.