fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context by sayakmaity · Pull Request #849 · scaleapi/llm-engine

sayakmaity · 2026-06-25T21:26:06Z

Problem

model-engine's Datadog env tag is derived from the build-time Helm value .Values.context, so every deployment reports the same env (e.g. production) regardless of which cluster/environment it actually runs in. This affects two independent surfaces:

Control-plane pods (gateway, builder, cacher, celery autoscaler) — DD_ENV and the tags.datadoghq.com/env label both come straight from .Values.context.
Runtime-launched inference endpoints — DD_ENV is baked into the rendered service_template_config_map from .Values.context at helm template time, so it's frozen in the image and identical for every cluster the image is deployed to (unlike the neighboring DD_SERVICE/DD_VERSION, which are already ${...} runtime-substituted).

Approach

Introduce a dedicated, optional datadog.env value, decoupled from context (which is overloaded — it also selects the service_template_config_map variant, drives a context == "circleci" conditional, and sets non-DD labels, so it must not be repurposed). The fix mirrors the existing GIT_TAG/DD_VERSION split exactly:

Control-plane → DD_ENV / env label = {{ .Values.datadog.env | default .Values.context }} (Helm value, backwards-compatible: falls back to context when unset).
Inference endpoints → DD_ENV = ${DD_ENV} runtime substitution, populated at endpoint-creation from the gateway's own DD_ENV (so launched pods inherit the gateway's per-cluster env). This follows the same pattern as ${DD_TRACE_ENABLED} / ${GIT_TAG}.

Changes

Chart (charts/model-engine)

values.yaml: add optional datadog.env (default "").
_helpers.tpl:
- baseLabels: tags.datadoghq.com/env → datadog.env | default context.
- serviceEnvBase: remove the hardcoded DD_ENV (moved into the wrappers, like GIT_TAG).
- serviceEnvGitTagFromHelmVar (control-plane): add DD_ENV = datadog.env | default context.
- serviceEnvGitTagFromPythonReplace (endpoints): add DD_ENV = ${DD_ENV}.
- baseServiceTemplateEnv / baseForwarderTemplateEnv (legacy endpoint paths): DD_ENV → ${DD_ENV}.
celery_autoscaler_stateful_set.yaml: $env → datadog.env | default context.
Chart.yaml: 0.2.6 → 0.2.7.

Server (model-engine)

common/env_vars.py: add DD_ENV (os.environ.get("DD_ENV") or infra_config().env).
infra/gateways/resources/k8s_resource_types.py: add DD_ENV to _BaseDeploymentArguments and pass it in all 11 deployment-argument constructors (alongside DD_TRACE_ENABLED).

Validation

helm template (with values_sample.yaml): control-plane DD_ENV/label render to datadog.env when set and fall back to context when unset; inference DD_ENV renders to ${DD_ENV} for runtime substitution. No <no value> / unrendered {{ }}.
python -m ast parse clean on both edited modules.
Substitution layer uses safe_substitute, and get_endpoint_resource_arguments_from_request now supplies DD_ENV, so ${DD_ENV} resolves at endpoint creation. The existing test_k8s_endpoint_resource_delegate.py tests render the real chart and assert structurally (no golden env-block comparison), so they remain compatible.
I did not run the full model-engine test suite locally (heavy service deps) — CI should confirm.

Out of scope / known limitations

The inference pod label tags.datadoghq.com/env (baseTemplateLabels) is intentionally left on context for now — that helper is shared with job templates, so making it ${DD_ENV} would require threading DD_ENV through the job argument classes too. The DD_ENV env var is the authoritative APM env source, so this is sufficient to fix env tagging on traces/metrics; the label can follow up.
Downstream: consumers that pin a specific chart version / bake the rendered service_template_config_map_<env>.yaml into an image (e.g. model-engine-internal's just autogen-templates + image build) must regenerate templates and rebuild the image to pick up ${DD_ENV}; and the rendered config map's DD_ENV line will become a runtime placeholder.

Draft pending CI + downstream coordination.

…context

sayakmaity · 2026-06-25T21:27:00Z

SGP consumer wiring that populates engine.datadog.env from the per-cluster info object: scaleapi/sgp#3730 (draft).

…infra_config().env

sayakmaity · 2026-06-26T16:37:23Z

Follow-up commit (ad9c421): tag gateway custom metrics with DD_ENV instead of infra_config().env.

DatadogMonitoringMetricsGateway tagged its statsd metrics with env:{infra_config().env} — the build-time deployment class, not the cluster's Datadog env. Switched to the new per-cluster DD_ENV. (celery_autoscaler.py already reads os.getenv("DD_ENV"), so it's covered by the chart change.) This pairs with scaleapi/sgp#3730, which removes a GCP-only infra.env=dev override; without this commit, that removal would have regressed GCP-dev custom metrics to env:production.

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time …

076adf0

…context

fix: tag gateway custom metrics with DD_ENV (per-cluster) instead of …

ad9c421

…infra_config().env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
sayakmaity wants to merge 2 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sayakmaity commented Jun 25, 2026

Problem

Approach

Changes

Validation

Out of scope / known limitations

Uh oh!

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant