Skip to content

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849

Draft
sayakmaity wants to merge 2 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster
Draft

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
sayakmaity wants to merge 2 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster

Conversation

@sayakmaity

Copy link
Copy Markdown

Problem

model-engine's Datadog env tag is derived from the build-time Helm value .Values.context, so every deployment reports the same env (e.g. production) regardless of which cluster/environment it actually runs in. This affects two independent surfaces:

  1. Control-plane pods (gateway, builder, cacher, celery autoscaler) — DD_ENV and the tags.datadoghq.com/env label both come straight from .Values.context.
  2. Runtime-launched inference endpointsDD_ENV is baked into the rendered service_template_config_map from .Values.context at helm template time, so it's frozen in the image and identical for every cluster the image is deployed to (unlike the neighboring DD_SERVICE/DD_VERSION, which are already ${...} runtime-substituted).

Approach

Introduce a dedicated, optional datadog.env value, decoupled from context (which is overloaded — it also selects the service_template_config_map variant, drives a context == "circleci" conditional, and sets non-DD labels, so it must not be repurposed). The fix mirrors the existing GIT_TAG/DD_VERSION split exactly:

  • Control-planeDD_ENV / env label = {{ .Values.datadog.env | default .Values.context }} (Helm value, backwards-compatible: falls back to context when unset).
  • Inference endpointsDD_ENV = ${DD_ENV} runtime substitution, populated at endpoint-creation from the gateway's own DD_ENV (so launched pods inherit the gateway's per-cluster env). This follows the same pattern as ${DD_TRACE_ENABLED} / ${GIT_TAG}.

Changes

Chart (charts/model-engine)

  • values.yaml: add optional datadog.env (default "").
  • _helpers.tpl:
    • baseLabels: tags.datadoghq.com/envdatadog.env | default context.
    • serviceEnvBase: remove the hardcoded DD_ENV (moved into the wrappers, like GIT_TAG).
    • serviceEnvGitTagFromHelmVar (control-plane): add DD_ENV = datadog.env | default context.
    • serviceEnvGitTagFromPythonReplace (endpoints): add DD_ENV = ${DD_ENV}.
    • baseServiceTemplateEnv / baseForwarderTemplateEnv (legacy endpoint paths): DD_ENV${DD_ENV}.
  • celery_autoscaler_stateful_set.yaml: $envdatadog.env | default context.
  • Chart.yaml: 0.2.60.2.7.

Server (model-engine)

  • common/env_vars.py: add DD_ENV (os.environ.get("DD_ENV") or infra_config().env).
  • infra/gateways/resources/k8s_resource_types.py: add DD_ENV to _BaseDeploymentArguments and pass it in all 11 deployment-argument constructors (alongside DD_TRACE_ENABLED).

Validation

  • helm template (with values_sample.yaml): control-plane DD_ENV/label render to datadog.env when set and fall back to context when unset; inference DD_ENV renders to ${DD_ENV} for runtime substitution. No <no value> / unrendered {{ }}.
  • python -m ast parse clean on both edited modules.
  • Substitution layer uses safe_substitute, and get_endpoint_resource_arguments_from_request now supplies DD_ENV, so ${DD_ENV} resolves at endpoint creation. The existing test_k8s_endpoint_resource_delegate.py tests render the real chart and assert structurally (no golden env-block comparison), so they remain compatible.
  • I did not run the full model-engine test suite locally (heavy service deps) — CI should confirm.

Out of scope / known limitations

  • The inference pod label tags.datadoghq.com/env (baseTemplateLabels) is intentionally left on context for now — that helper is shared with job templates, so making it ${DD_ENV} would require threading DD_ENV through the job argument classes too. The DD_ENV env var is the authoritative APM env source, so this is sufficient to fix env tagging on traces/metrics; the label can follow up.
  • Downstream: consumers that pin a specific chart version / bake the rendered service_template_config_map_<env>.yaml into an image (e.g. model-engine-internal's just autogen-templates + image build) must regenerate templates and rebuild the image to pick up ${DD_ENV}; and the rendered config map's DD_ENV line will become a runtime placeholder.

Draft pending CI + downstream coordination.

@sayakmaity

Copy link
Copy Markdown
Author

SGP consumer wiring that populates engine.datadog.env from the per-cluster info object: scaleapi/sgp#3730 (draft).

@sayakmaity

Copy link
Copy Markdown
Author

Follow-up commit (ad9c421): tag gateway custom metrics with DD_ENV instead of infra_config().env.

DatadogMonitoringMetricsGateway tagged its statsd metrics with env:{infra_config().env} — the build-time deployment class, not the cluster's Datadog env. Switched to the new per-cluster DD_ENV. (celery_autoscaler.py already reads os.getenv("DD_ENV"), so it's covered by the chart change.) This pairs with scaleapi/sgp#3730, which removes a GCP-only infra.env=dev override; without this commit, that removal would have regressed GCP-dev custom metrics to env:production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant