fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
Draft
sayakmaity wants to merge 2 commits into
Draft
fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849sayakmaity wants to merge 2 commits into
sayakmaity wants to merge 2 commits into
Conversation
Author
|
SGP consumer wiring that populates |
…infra_config().env
Author
|
Follow-up commit (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
model-engine's Datadog
envtag is derived from the build-time Helm value.Values.context, so every deployment reports the sameenv(e.g.production) regardless of which cluster/environment it actually runs in. This affects two independent surfaces:DD_ENVand thetags.datadoghq.com/envlabel both come straight from.Values.context.DD_ENVis baked into the renderedservice_template_config_mapfrom.Values.contextathelm templatetime, so it's frozen in the image and identical for every cluster the image is deployed to (unlike the neighboringDD_SERVICE/DD_VERSION, which are already${...}runtime-substituted).Approach
Introduce a dedicated, optional
datadog.envvalue, decoupled fromcontext(which is overloaded — it also selects theservice_template_config_mapvariant, drives acontext == "circleci"conditional, and sets non-DD labels, so it must not be repurposed). The fix mirrors the existingGIT_TAG/DD_VERSIONsplit exactly:DD_ENV/ env label ={{ .Values.datadog.env | default .Values.context }}(Helm value, backwards-compatible: falls back tocontextwhen unset).DD_ENV=${DD_ENV}runtime substitution, populated at endpoint-creation from the gateway's ownDD_ENV(so launched pods inherit the gateway's per-cluster env). This follows the same pattern as${DD_TRACE_ENABLED}/${GIT_TAG}.Changes
Chart (
charts/model-engine)values.yaml: add optionaldatadog.env(default"")._helpers.tpl:baseLabels:tags.datadoghq.com/env→datadog.env | default context.serviceEnvBase: remove the hardcodedDD_ENV(moved into the wrappers, likeGIT_TAG).serviceEnvGitTagFromHelmVar(control-plane): addDD_ENV=datadog.env | default context.serviceEnvGitTagFromPythonReplace(endpoints): addDD_ENV=${DD_ENV}.baseServiceTemplateEnv/baseForwarderTemplateEnv(legacy endpoint paths):DD_ENV→${DD_ENV}.celery_autoscaler_stateful_set.yaml:$env→datadog.env | default context.Chart.yaml:0.2.6→0.2.7.Server (
model-engine)common/env_vars.py: addDD_ENV(os.environ.get("DD_ENV") or infra_config().env).infra/gateways/resources/k8s_resource_types.py: addDD_ENVto_BaseDeploymentArgumentsand pass it in all 11 deployment-argument constructors (alongsideDD_TRACE_ENABLED).Validation
helm template(withvalues_sample.yaml): control-planeDD_ENV/label render todatadog.envwhen set and fall back tocontextwhen unset; inferenceDD_ENVrenders to${DD_ENV}for runtime substitution. No<no value>/ unrendered{{ }}.python -m astparse clean on both edited modules.safe_substitute, andget_endpoint_resource_arguments_from_requestnow suppliesDD_ENV, so${DD_ENV}resolves at endpoint creation. The existingtest_k8s_endpoint_resource_delegate.pytests render the real chart and assert structurally (no golden env-block comparison), so they remain compatible.model-enginetest suite locally (heavy service deps) — CI should confirm.Out of scope / known limitations
tags.datadoghq.com/env(baseTemplateLabels) is intentionally left oncontextfor now — that helper is shared with job templates, so making it${DD_ENV}would require threadingDD_ENVthrough the job argument classes too. TheDD_ENVenv var is the authoritative APMenvsource, so this is sufficient to fix env tagging on traces/metrics; the label can follow up.service_template_config_map_<env>.yamlinto an image (e.g. model-engine-internal'sjust autogen-templates+ image build) must regenerate templates and rebuild the image to pick up${DD_ENV}; and the rendered config map'sDD_ENVline will become a runtime placeholder.Draft pending CI + downstream coordination.