Matching service: unbounded Prometheus metric cardinality growth from otel gauges

## Expected Behavior

- Prometheus metric cardinality for Temporal **matching** should be roughly bounded by the set of **live physical task queues** and their labels (namespaces × queues × task types × partitions, etc.).
- When a `PhysicalTaskQueueManager` is unloaded or a task queue becomes idle/obsolete, its gauge time series should not accumulate indefinitely over pod lifetime.
- Long-lived clusters should not see `scrape_samples_scraped` increase monotonically per matching pod purely as a function of pod uptime.


## Actual Behavior

### Matching Service - Primary 

Gauge values leak in `gaugeAdapter.values` — entries are never removed

The OpenTelemetry gauge implementation in Temporal uses a `gaugeAdapter` that stores gauge values in a map keyed by label set. Entries are only ever added, never deleted.

```
File: common/metrics/otel_metrics_handler.go:31-41,146-151

type gaugeAdapter struct {
    lock   sync.Mutex
    values map[attribute.Distinct]gaugeValue  // ONLY ADDS, NEVER REMOVES
}

func (g *gaugeAdapterGauge) Record(v float64, tags ...Tag) {
    set := g.omp.makeSet(tags)
    g.adapter.lock.Lock()
    defer g.adapter.lock.Unlock()
    g.adapter.values[set.Equivalent()] = gaugeValue{value: v, set: set}
}
```

On every Prometheus scrape, the callback iterates ALL entries (otel_metrics_handler.go:137-143), reporting every label combination that was ever recorded — including stale ones from unloaded task queues.

What creates new label combinations over time

Each PhysicalTaskQueueManager gets a uniquely-tagged metrics handler (physical_task_queue_manager.go:132-134):

```
taggedMetricsHandler := partitionMgr.metricsHandler.WithTags(
    metrics.OperationTag(metrics.MatchingTaskQueueMgrScope),
    metrics.WorkerBuildIdTag(buildIdTagValue, config.BreakdownMetricsByBuildID()))
```

New physical task queues are created for:

- Each unique worker build ID (every deployment with a new build ID)
- Each deployment series + build ID pair
- Each version set (old versioning API)

Gauge metrics emitted per physical TQ include:
- approximate_backlog_count (db.go:709)
- approximate_backlog_age_seconds (db.go:711-713)
- task_lag_per_tl (db.go:715)

Full cardinality = namespaces × task_queues × build_IDs × partitions × task_types × ~3 gauge metrics

### History Service (Secondary)

History uses counters/histograms (not gauges) with dynamic labels, so the growth is slower but still present:

Events cache — NamespaceIDTag on every cache operation (events/cache.go:111,147,159,176)

Workflow cache — NamespaceIDTag on every access (workflow/cache/cache.go:182-185)

Workflow completion — WorkflowTypeTag on every workflow close (workflow/metrics.go:94)

Mutable state stats — NamespaceTag on every persist (workflow/transaction_impl.go:684,699)

Growth is proportional to unique namespace IDs × workflow types. More bounded than matching but still unbounded.


## Steps to Reproduce the Problem

1. Deploy Temporal Server **v1.29.4** via `temporal-operator` on Kubernetes.
2. Enable Prometheus scraping for the matching and history services (standard `/metrics` endpoint).
3. Run workloads over time that:
   - Create many task queues and partitions.
   - Roll out multiple worker **build IDs** / deployment versions and version sets.
4. Observe over days/weeks:
   - `scrape_samples_scraped{job="temporal-cluster-matching-headless", clusterName="ahp", ...}` increases steadily for long-lived matching pods.
   - A similar, but slower, increasing pattern appears for the history service.

## Specifications

  - Version: Temporal Server **v1.29.4** (server), using the otel metrics implementation (`gaugeAdapter`).
  - Platform: Kubernetes (deployed via `temporal-operator`), Prometheus for metrics scraping, cluster with many namespaces / task queues / worker build IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching service: unbounded Prometheus metric cardinality growth from otel gauges #9945

Expected Behavior

Actual Behavior

Matching Service - Primary

History Service (Secondary)

Steps to Reproduce the Problem

Specifications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Matching service: unbounded Prometheus metric cardinality growth from otel gauges #9945

Description

Expected Behavior

Actual Behavior

Matching Service - Primary

History Service (Secondary)

Steps to Reproduce the Problem

Specifications

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions