Skip to content

Matching service: unbounded Prometheus metric cardinality growth from otel gauges #9945

@Sanil2108

Description

@Sanil2108

Expected Behavior

  • Prometheus metric cardinality for Temporal matching should be roughly bounded by the set of live physical task queues and their labels (namespaces × queues × task types × partitions, etc.).
  • When a PhysicalTaskQueueManager is unloaded or a task queue becomes idle/obsolete, its gauge time series should not accumulate indefinitely over pod lifetime.
  • Long-lived clusters should not see scrape_samples_scraped increase monotonically per matching pod purely as a function of pod uptime.

Actual Behavior

Matching Service - Primary

Gauge values leak in gaugeAdapter.values — entries are never removed

The OpenTelemetry gauge implementation in Temporal uses a gaugeAdapter that stores gauge values in a map keyed by label set. Entries are only ever added, never deleted.

File: common/metrics/otel_metrics_handler.go:31-41,146-151

type gaugeAdapter struct {
    lock   sync.Mutex
    values map[attribute.Distinct]gaugeValue  // ONLY ADDS, NEVER REMOVES
}

func (g *gaugeAdapterGauge) Record(v float64, tags ...Tag) {
    set := g.omp.makeSet(tags)
    g.adapter.lock.Lock()
    defer g.adapter.lock.Unlock()
    g.adapter.values[set.Equivalent()] = gaugeValue{value: v, set: set}
}

On every Prometheus scrape, the callback iterates ALL entries (otel_metrics_handler.go:137-143), reporting every label combination that was ever recorded — including stale ones from unloaded task queues.

What creates new label combinations over time

Each PhysicalTaskQueueManager gets a uniquely-tagged metrics handler (physical_task_queue_manager.go:132-134):

taggedMetricsHandler := partitionMgr.metricsHandler.WithTags(
    metrics.OperationTag(metrics.MatchingTaskQueueMgrScope),
    metrics.WorkerBuildIdTag(buildIdTagValue, config.BreakdownMetricsByBuildID()))

New physical task queues are created for:

  • Each unique worker build ID (every deployment with a new build ID)
  • Each deployment series + build ID pair
  • Each version set (old versioning API)

Gauge metrics emitted per physical TQ include:

  • approximate_backlog_count (db.go:709)
  • approximate_backlog_age_seconds (db.go:711-713)
  • task_lag_per_tl (db.go:715)

Full cardinality = namespaces × task_queues × build_IDs × partitions × task_types × ~3 gauge metrics

History Service (Secondary)

History uses counters/histograms (not gauges) with dynamic labels, so the growth is slower but still present:

Events cache — NamespaceIDTag on every cache operation (events/cache.go:111,147,159,176)

Workflow cache — NamespaceIDTag on every access (workflow/cache/cache.go:182-185)

Workflow completion — WorkflowTypeTag on every workflow close (workflow/metrics.go:94)

Mutable state stats — NamespaceTag on every persist (workflow/transaction_impl.go:684,699)

Growth is proportional to unique namespace IDs × workflow types. More bounded than matching but still unbounded.

Steps to Reproduce the Problem

  1. Deploy Temporal Server v1.29.4 via temporal-operator on Kubernetes.
  2. Enable Prometheus scraping for the matching and history services (standard /metrics endpoint).
  3. Run workloads over time that:
    • Create many task queues and partitions.
    • Roll out multiple worker build IDs / deployment versions and version sets.
  4. Observe over days/weeks:
    • scrape_samples_scraped{job="temporal-cluster-matching-headless", clusterName="ahp", ...} increases steadily for long-lived matching pods.
    • A similar, but slower, increasing pattern appears for the history service.

Specifications

  • Version: Temporal Server v1.29.4 (server), using the otel metrics implementation (gaugeAdapter).
  • Platform: Kubernetes (deployed via temporal-operator), Prometheus for metrics scraping, cluster with many namespaces / task queues / worker build IDs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions