Skip to content

Add docs recommending autoscaling setup#324

Open
carlydf wants to merge 7 commits into
mainfrom
demo-ga-no-recording-rule
Open

Add docs recommending autoscaling setup#324
carlydf wants to merge 7 commits into
mainfrom
demo-ga-no-recording-rule

Conversation

@carlydf

@carlydf carlydf commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Adds documentation outlining the tradeoffs between two autoscaling solutions:

  1. HPA+prometheus adapter
  2. KEDA Temporal Scaler

Documentation focuses on straightforward descriptions of the pros and cons of each solution.

@carlydf carlydf requested review from a team and jlegrone as code owners May 14, 2026 01:52
@carlydf carlydf marked this pull request as draft May 14, 2026 02:03

@jaypipes jaypipes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carlydf , I've done a first go-around reviewing this documentation and adding (quite a few) suggested changes and removals to "de-Claude" some of it and make it (hopefully) a bit more readable for a general audience.

Comment thread docs/README.md
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
Comment thread docs/scaling-recommendations.md Outdated
@jaypipes jaypipes marked this pull request as ready for review June 1, 2026 16:35
@jaypipes jaypipes changed the title Drop backlog recording rule; consume raw temporal_cloud_v1_approximate_backlog_count Add docs recommending autoscaling setup Jun 3, 2026

@Shivs11 Shivs11 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of nits -- looks g to me otherwise


> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.

HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i took about a second to understand what this really meant - at first, I thought this meant that there won't be a backlog metric emitted if you don't have workers running at all (which is not true since you do have this metric being emitted for the unversioned world without workers being present)

I know you have clearly mentioned versions in the preamble here, but do you think we can be extra clear and mention the backlog count per version is not emitted without a worker being present since that is what creates a version in temporal?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shivs11 do we really need to care about the unversioned world in TWC? I mean, TWC doesn't deal with anything that isn't versioned since it automatically creates WorkerDeploymentVersions for new worker image tags...

Comment thread docs/scaling-recommendations.md
@jaypipes jaypipes force-pushed the demo-ga-no-recording-rule branch from 718471b to f8335de Compare June 9, 2026 12:49
@jaypipes jaypipes self-requested a review June 9, 2026 12:53

`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. This is a metric provided by [Temporal Cloud's OpenMetrics aggregation service][tc-openmetrics].

`temporal_slot_utilization` (or just "slot util") is emitted directly by Workers (no Temporal Cloud aggregation), scraped at the Prometheus `ServiceMonitor` interval (~10–30 s), and reflects the current state of a particular Worker. This metric rises *before* backlog accumulates. In other words, slots on the Worker saturate first, then queueing starts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a documentation about worker slot setup? In our Datadog environment we ended up setting the default for slot utilization to 1000 and we noticed that we never actually get anywhere close in using them up. It might be worth mentioning for the customers that if this is not properly adjusted for the worker's resources the scaling might not work as expected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eniko-dif I've added a link to this doc: https://docs.temporal.io/develop/worker-performance

Unfortunately, I don't think there's going to be a one-size-fits-all recommendation for slot utilization values because this is going to be dependent on the rate of the user's workflows (and their composing activities) being executed. Obviously that's going to be highly user-specific. I've added a link to the docs about choosing an appropriate slot supplier and highlighted this callout from that doc:

Scenarios with tasks that have variable, or very high, per-task resource needs should rely on fixed-size suppliers and manual tuning rather than resource-based suppliers.

target:
type: AverageValue
averageValue: "1"
behavior:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it an issue that HPA would scale up if the system is well-sized? (meaning that the backlog count is not building up, but the workers have a high slot utilization)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens, the user would want to adjust the AverageValue target value for temporal_slot_utilization so that it would not trigger a scale up, no? Or adjust the stabilization window...

└─ first replica added
```
[tc-openmetrics]: https://docs.temporal.io/cloud/metrics/openmetrics

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i'd mention here that an example configuration is available a bit below (maybe with a link to jump through keda if the user is not interested)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the recommended config up.

Comment thread docs/scaling-recommendations.md Outdated
matchLabels:
worker_type: "ActivityWorker"
target:
type: Value

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add the comment from the wrt-hpa-backlog example why this is a Value, because upon first look I was confused.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this directly from the existing examples in the demo/ directory :) But I agree with you that it should more correctly be AverageValue and a value of "0.75" not "750m". Updated.

task_type: "Activity"
target:
type: AverageValue
averageValue: "1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 1 a bit too low?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily... depends on the workload... for a high-request-rate workflow/taskqueue, with workers struggling to process the incoming requests, the backlog count might be much higher than 1, but it all depends on the workload...

scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally put quicker scale up values than scale down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what this example shows?

carlydf and others added 7 commits June 11, 2026 12:08
The backlog metric pipeline goes from prometheus-adapter directly to the
raw temporal_cloud_v1_approximate_backlog_count series, eliminating the
temporal_approximate_backlog_count recording rule. Adapter rule:

- seriesQuery filters out temporal_worker_build_id="__unversioned__" so
  discovery doesn't choke on the 5000+ unversioned series in typical
  accounts.
- metricsQuery sum(...) collapses labels the HPA doesn't select on at
  query time (instance/job/region/task_priority/temporal_account).
- metricsRelistInterval is bumped to 5m to accommodate the ~3-minute
  embedded-timestamp lag in Temporal Cloud's OpenMetrics emission.

WRT example, prometheus-stack-values, and demo README are updated to
match. Add docs/scaling-recommendations.md covering the empirically
measured reactivity model (steady-state ~3:15 dominated by Cloud
aggregation lag), task-queue-unload behavior, scale-from-zero limits,
and when to pick KEDA over the metric path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial scaling-recommendations.md framed steady-state HPA reactivity as
~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The
actual sample-age distribution on the OpenMetrics endpoint is:

  p50  30s  (matches ~1/min emission cadence, age oscillates 0-60s)
  p95  50s
  p99  ~tail of occasional gateway-wide stalls

So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll),
not ~3:15. The 3-minute figures came from observations made during the
occasional periods when the OpenMetrics gateway returns frozen
timestamps across every series in the account simultaneously - those
stalls are real but not steady-state.

Doc now:
- Replaces the 3:15 figure with empirically-derived ~85s typical.
- Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp
  behavior observationally (no speculation about cause).
- Keeps the metricsRelistInterval: 5m recommendation, now justified by
  the need to exceed stall duration rather than the misattributed
  "aggregation lag."
- Demo README updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier wording implied multiple stall events ("occasional periods")
when we have only directly characterized one such event during this
investigation. Reword to describe exactly what was seen, note that
frequency is not yet known, and that the behavior is open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified directly: across a 3-hour window including one of the observed
"stall" events, every gap between consecutive sample timestamps in
Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint
isn't dropping or freezing emissions - it's delivering them late, in
bursts after a delay, with their original minute-aligned timestamps.

The retrospective record looks complete (good for dashboards), but live
HPA consumers see the delay as real staleness because they query the
latest available timestamp at decision time. Reframe the caveat in the
scaling doc and demo README accordingly.

Also note we observed two such delay events in ~2 hours of close
observation - frequency in normal operation is still open with the
Observability team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jay Pipes <jaypipes@gmail.com>
Co-authored-by: Stefan Richter <stefan@02strich.de>
Removes a bunch of overly verbose Claude-generated stuff that will
likely confuse readers. Reworded a few places where Claude was using
some odd terminology -- e.g. "typical end-to-end reactivity" -- to use
more straightforward verbiage. Added a brief WRT example HPA template
that shows the stabilization window that is referred to in multiple
sections of the doc.

Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
Signed-off-by: Jay Pipes <jay.pipes@temporal.io>
@jaypipes jaypipes force-pushed the demo-ga-no-recording-rule branch from d452af5 to d1e2d02 Compare June 11, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants