Skip to content

Metrics

Metrics let you track how Watchtower behaves over time — scan cadence, registry reliability, security-relevant events, and per-container watch status.

To use this feature, enable the metrics API, set an API token (or opt out via --http-api-metrics-no-auth for trusted networks), and map port 8080 from the container.

The endpoint is GET /v1/metrics, served in Prometheus exposition format.

Scrape configuration

scrape_configs:
  - job_name: watchtower
    scrape_interval: 30s
    metrics_path: /v1/metrics
    bearer_token: demotoken
    static_configs:
      - targets:
        - 'watchtower:8080'

Replace demotoken with the token you set via --http-api-token. Drop the bearer_token line entirely if --http-api-metrics-no-auth is active.

For homelab cadences (polls every minutes-to-hours), scrape_interval: 30s is plenty. Tighter intervals don't help because the underlying gauges only change once per scan.

Available metrics

Grouped by what they tell you.

Scan cycle

Metric Type What it tells you
watchtower_scans_total counter Poll cycles since the daemon started. rate() gives you scans-per-second.
watchtower_scans_skipped counter Cycles where another update was still in flight. Non-zero suggests an HTTP API request is racing the scheduler or the update is stuck.
watchtower_containers_scanned gauge Containers inspected during the last scan.
watchtower_containers_updated gauge Containers recreated during the last scan.
watchtower_containers_failed gauge Containers whose update failed during the last scan.
watchtower_last_scan_timestamp_seconds gauge Unix timestamp of the most recent completed scan. Pair with time() for staleness alerts.
watchtower_poll_interval_seconds gauge Configured cadence between scans, derived from the active schedule at startup. Scale alert thresholds by this instead of hardcoding a window.
watchtower_poll_duration_seconds histogram Wall-clock duration of each scan + update cycle. Buckets from 0.5s to 5m. Use histogram_quantile(0.95, ...) for p95.

Watch status

Published every scan regardless of whether any audit flag is set.

Metric Type What it tells you
watchtower_containers_managed gauge Containers with com.centurylinklabs.watchtower.enable=true.
watchtower_containers_excluded gauge Containers with com.centurylinklabs.watchtower.enable=false (intentional opt-out).
watchtower_containers_unmanaged gauge Containers with no enable label at all. Under --label-enable these are silently skipped — hit /v1/audit for names or enable --audit-unmanaged for log warnings. Excludes Docker-managed infrastructure (buildkit etc.), which is tracked separately in watchtower_containers_infrastructure.
watchtower_containers_infrastructure gauge Docker-managed scaffolding (moby/buildkit* image prefix, docker/desktop-* image prefix, com.docker.buildx.* / com.docker.desktop.* label prefixes). Not a user workload; tracked separately so transient builder containers don't show up as unmanaged noise.

Update lifecycle

Metric Type What it tells you
watchtower_rollbacks_total counter Rollbacks triggered by --health-check-gated. Each increment = a replacement container failed health check and the previous image was restored.
watchtower_containers_in_cooldown gauge Containers currently waiting out a --image-cooldown window. Non-zero right after a fresh push; stuck non-zero means the author keeps re-pushing and resetting the clock.
watchtower_image_fallback_total counter Times GetContainer fell back to inspecting by image reference because the source image ID was missing locally. Sustained counts indicate external tooling is deleting images Watchtower still needs. Background: upstream#1217.

HTTP API (/v1/* endpoints)

Metric Labels What it tells you
watchtower_api_requests_total endpoint, status One counter per endpoint and response status code. A burst of status="401" on endpoint="/v1/update" is usually credential stuffing.

Registry traffic

Metric Labels What it tells you
watchtower_registry_requests_total host, operation, outcome Outbound requests to registries. Operations are challenge, token, digest; outcomes are success, error, retried.
watchtower_registry_retries_total host Bounded-backoff retry attempts. Zero is healthy; sustained non-zero means a flaky registry.
watchtower_auth_cache_hits_total Bearer-token cache hits. High rate means the in-memory cache is sparing the oauth endpoint.
watchtower_auth_cache_misses_total Cache misses — each miss triggers an oauth exchange.

Docker daemon

Metric Labels What it tells you
watchtower_docker_api_errors_total operation Errors from the Docker engine API, broken down by operation. Operations: list, inspect, kill, start, create, remove, image_inspect, image_remove, image_pull, rename, network_connect, network_disconnect. Sustained non-zero rates usually mean socket permission issues or a daemon under load.

Useful queries

Staleness: time() - watchtower_last_scan_timestamp_seconds — seconds since the last scan.

p95 scan duration: histogram_quantile(0.95, sum by (le) (rate(watchtower_poll_duration_seconds_bucket[5m]))).

Bearer-cache hit ratio: sum(increase(watchtower_auth_cache_hits_total[1h])) / clamp_min(sum(increase(watchtower_auth_cache_hits_total[1h])) + sum(increase(watchtower_auth_cache_misses_total[1h])), 1).

Unmanaged containers present for > 1 h: watchtower_containers_unmanaged > 0 with an alert for: 1h.

Dashboards and alerts

Ready-to-import Grafana dashboard and Prometheus alerting rules ship under observability/ in the source tree. Dashboard covers the three rows above (overview, watch status, reliability + security), plus three annotation tracks for rollbacks, daemon restarts, and newly-appeared unmanaged containers.

Demo

The repository contains a demo with Prometheus and Grafana, available via docker-compose.yml. This demo is preconfigured with the dashboard:

grafana metrics