Skip to content

feat: MLflow metrics visualization, enhanced wait UI, and eval job links#5662

Open
mollyheamazon wants to merge 11 commits intoaws:masterfrom
mollyheamazon:feat/mlflow-mc
Open

feat: MLflow metrics visualization, enhanced wait UI, and eval job links#5662
mollyheamazon wants to merge 11 commits intoaws:masterfrom
mollyheamazon:feat/mlflow-mc

Conversation

@mollyheamazon
Copy link
Contributor

@mollyheamazon mollyheamazon commented Mar 20, 2026

What's new

New public APIs (sagemaker.train)

get_studio_url — get a SageMaker Studio URL for any training job:

from sagemaker.train import get_studio_url

url = get_studio_url(training_job)                                          # TrainingJob object
url = get_studio_url("my-job-name")                                         # job name string
url = get_studio_url("arn:aws:sagemaker:us-west-2:123:training-job/my-job") # ARN string

get_mlflow_url — get a presigned MLflow experiment URL (valid 5 min):

from sagemaker.train import get_mlflow_url

url = get_mlflow_url(training_job)
url = get_mlflow_url("my-job-name")

plot_training_metrics — plot MLflow metrics from a completed training job in Jupyter (requires
sagemaker-train[notebook]):

from sagemaker.train import plot_training_metrics

plot_training_metrics(training_job)                          # all metrics
plot_training_metrics(training_job, metrics=["loss", "accuracy"])  # specific metrics

get_available_metrics — list available MLflow metrics for a job:

from sagemaker.train import get_available_metrics

metrics = get_available_metrics(training_job)
# ['loss', 'accuracy', 'eval_loss', ...]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Enhancements

Training job wait() UI overhaul

  • Adds TrainingJob ARN row and clickable Console / Studio / CloudWatch links
  • MLflow experiment link with auto-refresh every 4 min (before 5-min presigned URL expiry)
  • Smarter render throttling — only re-renders on status change or every 2s, reducing flicker

Evaluation pipeline wait() UI

  • Pipeline execution Studio link in header
  • Per-step job ARN table with Console, Studio, and CloudWatch links for each pipeline step
  • Steps now rendered in chronological order (earliest first)

Loss metric detection broadened

  • Previously matched only exact total_loss; now matches any metric containing "loss" via
    LOSS_METRIC_KEYWORDS, improving coverage across model families

Optional notebook dependencies

  • ipywidgets, rich, matplotlib added to optional extra — install with:
  pip install sagemaker-train[notebook]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bug fixes

  • resource_config null-safety: TrainingJob.wait() no longer crashes when resource_config is Unassigned
  • MlflowRunName removed from pipeline templates: The pipeline definition uses the MLflowConfiguration schema (pipeline-level), which only supports MlflowResourceArn and MlflowExperimentName. MlflowRunName belongs to MlflowConfig (training job-level) and is not a valid field in the pipeline definition — passing it was causing API validation errors.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Testing

  • 15 new unit tests added in tests/unit/train/common_utils/test_metrics_visualizer.py covering _parse_job_arn, get_console_job_url, get_cloudwatch_logs_url, get_studio_url (object / ARN / job-name inputs), and get_available_metrics

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mollyheamazon mollyheamazon changed the title Feat/mlflow mc feat: MLflow metrics visualization, enhanced wait UI, and eval job links Mar 21, 2026
@mollyheamazon mollyheamazon marked this pull request as ready for review March 21, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant