Add docs recommending autoscaling setup by carlydf · Pull Request #324 · temporalio/temporal-worker-controller

carlydf · 2026-05-14T01:52:36Z

Adds documentation outlining the tradeoffs between two autoscaling solutions:

HPA+prometheus adapter
KEDA Temporal Scaler

Documentation focuses on straightforward descriptions of the pros and cons of each solution.

jaypipes

Thanks @carlydf , I've done a first go-around reviewing this documentation and adding (quite a few) suggested changes and removals to "de-Claude" some of it and make it (hopefully) a bit more readable for a general audience.

Shivs11

a couple of nits -- looks g to me otherwise

Shivs11 · 2026-06-03T16:00:04Z

+
+> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
+
+HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.


i took about a second to understand what this really meant - at first, I thought this meant that there won't be a backlog metric emitted if you don't have workers running at all (which is not true since you do have this metric being emitted for the unversioned world without workers being present)

I know you have clearly mentioned versions in the preamble here, but do you think we can be extra clear and mention the backlog count per version is not emitted without a worker being present since that is what creates a version in temporal?

@Shivs11 do we really need to care about the unversioned world in TWC? I mean, TWC doesn't deal with anything that isn't versioned since it automatically creates WorkerDeploymentVersions for new worker image tags...

eniko-dif · 2026-06-09T20:26:21Z

+
+`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. This is a metric provided by [Temporal Cloud's OpenMetrics aggregation service][tc-openmetrics].
+
+`temporal_slot_utilization` (or just "slot util") is emitted directly by Workers (no Temporal Cloud aggregation), scraped at the Prometheus `ServiceMonitor` interval (~10–30 s), and reflects the current state of a particular Worker. This metric rises *before* backlog accumulates. In other words, slots on the Worker saturate first, then queueing starts.


Is there a documentation about worker slot setup? In our Datadog environment we ended up setting the default for slot utilization to 1000 and we noticed that we never actually get anywhere close in using them up. It might be worth mentioning for the customers that if this is not properly adjusted for the worker's resources the scaling might not work as expected.

@eniko-dif I've added a link to this doc: https://docs.temporal.io/develop/worker-performance

Unfortunately, I don't think there's going to be a one-size-fits-all recommendation for slot utilization values because this is going to be dependent on the rate of the user's workflows (and their composing activities) being executed. Obviously that's going to be highly user-specific. I've added a link to the docs about choosing an appropriate slot supplier and highlighted this callout from that doc:

Scenarios with tasks that have variable, or very high, per-task resource needs should rely on fixed-size suppliers and manual tuning rather than resource-based suppliers.

eniko-dif · 2026-06-09T20:32:40Z

+        target:
+          type: AverageValue
+          averageValue: "1"
+  behavior:


Isn't it an issue that HPA would scale up if the system is well-sized? (meaning that the backlog count is not building up, but the workers have a high slot utilization)

If this happens, the user would want to adjust the AverageValue target value for temporal_slot_utilization so that it would not trigger a scale up, no? Or adjust the stabilization window...

eniko-dif · 2026-06-09T20:33:17Z

+                      └─ first replica added
+```
+[tc-openmetrics]: https://docs.temporal.io/cloud/metrics/openmetrics
+


nit: i'd mention here that an example configuration is available a bit below (maybe with a link to jump through keda if the user is not interested)

I moved the recommended config up.

eniko-dif · 2026-06-09T20:40:34Z

+            matchLabels:
+              worker_type: "ActivityWorker"
+        target:
+          type: Value


nit: I'd add the comment from the wrt-hpa-backlog example why this is a Value, because upon first look I was confused.

I took this directly from the existing examples in the demo/ directory :) But I agree with you that it should more correctly be AverageValue and a value of "0.75" not "750m". Updated.

eniko-dif · 2026-06-09T20:41:15Z

+              task_type: "Activity"
+        target:
+          type: AverageValue
+          averageValue: "1"


Isn't 1 a bit too low?

Not necessarily... depends on the workload... for a high-request-rate workflow/taskqueue, with workers struggling to process the incoming requests, the backlog count might be much higher than 1, but it all depends on the workload...

eniko-dif · 2026-06-09T20:41:59Z

+    scaleUp:
+      stabilizationWindowSeconds: 30
+      policies:
+        - type: Percent


I'd personally put quicker scale up values than scale down.

Isn't that what this example shows?

The backlog metric pipeline goes from prometheus-adapter directly to the raw temporal_cloud_v1_approximate_backlog_count series, eliminating the temporal_approximate_backlog_count recording rule. Adapter rule: - seriesQuery filters out temporal_worker_build_id="__unversioned__" so discovery doesn't choke on the 5000+ unversioned series in typical accounts. - metricsQuery sum(...) collapses labels the HPA doesn't select on at query time (instance/job/region/task_priority/temporal_account). - metricsRelistInterval is bumped to 5m to accommodate the ~3-minute embedded-timestamp lag in Temporal Cloud's OpenMetrics emission. WRT example, prometheus-stack-values, and demo README are updated to match. Add docs/scaling-recommendations.md covering the empirically measured reactivity model (steady-state ~3:15 dominated by Cloud aggregation lag), task-queue-unload behavior, scale-from-zero limits, and when to pick KEDA over the metric path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Initial scaling-recommendations.md framed steady-state HPA reactivity as ~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The actual sample-age distribution on the OpenMetrics endpoint is: p50 30s (matches ~1/min emission cadence, age oscillates 0-60s) p95 50s p99 ~tail of occasional gateway-wide stalls So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll), not ~3:15. The 3-minute figures came from observations made during the occasional periods when the OpenMetrics gateway returns frozen timestamps across every series in the account simultaneously - those stalls are real but not steady-state. Doc now: - Replaces the 3:15 figure with empirically-derived ~85s typical. - Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp behavior observationally (no speculation about cause). - Keeps the metricsRelistInterval: 5m recommendation, now justified by the need to exceed stall duration rather than the misattributed "aggregation lag." - Demo README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier wording implied multiple stall events ("occasional periods") when we have only directly characterized one such event during this investigation. Reword to describe exactly what was seen, note that frequency is not yet known, and that the behavior is open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verified directly: across a 3-hour window including one of the observed "stall" events, every gap between consecutive sample timestamps in Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint isn't dropping or freezing emissions - it's delivering them late, in bursts after a delay, with their original minute-aligned timestamps. The retrospective record looks complete (good for dashboards), but live HPA consumers see the delay as real staleness because they query the latest available timestamp at decision time. Reframe the caveat in the scaling doc and demo README accordingly. Also note we observed two such delay events in ~2 hours of close observation - frequency in normal operation is still open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: Jay Pipes <jaypipes@gmail.com> Co-authored-by: Stefan Richter <stefan@02strich.de>

Removes a bunch of overly verbose Claude-generated stuff that will likely confuse readers. Reworded a few places where Claude was using some odd terminology -- e.g. "typical end-to-end reactivity" -- to use more straightforward verbiage. Added a brief WRT example HPA template that shows the stabilization window that is referred to in multiple sections of the doc. Signed-off-by: Jay Pipes <jay.pipes@temporal.io>

Signed-off-by: Jay Pipes <jay.pipes@temporal.io>

carlydf requested review from a team and jlegrone as code owners May 14, 2026 01:52

carlydf marked this pull request as draft May 14, 2026 02:03

jaypipes requested changes May 15, 2026

View reviewed changes

02strich reviewed May 19, 2026

View reviewed changes