Skip to content

Python: Add Cosmos DB NoSQL Checkpoint Storage for Python Workflows#4916

Open
aayush3011 wants to merge 7 commits intomicrosoft:mainfrom
aayush3011:cosmos_checkpointer
Open

Python: Add Cosmos DB NoSQL Checkpoint Storage for Python Workflows#4916
aayush3011 wants to merge 7 commits intomicrosoft:mainfrom
aayush3011:cosmos_checkpointer

Conversation

@aayush3011
Copy link

@aayush3011 aayush3011 commented Mar 25, 2026

Motivation and Context

The .NET implementation of the Agent Framework already ships a native CosmosCheckpointStore for workflow checkpointing, but the Python side only supports in-memory and file-based storage. Cosmos DB customers building agents on Azure AI Foundry have been asking for native Cosmos DB checkpoint support so they can durably pause and resume workflows across process restarts without writing custom storage adapters.

This PR adds CosmosCheckpointStorage to the existing agent-framework-azure-cosmos Python package, achieving feature parity with .NET and enabling Cosmos DB customers to use workflow checkpointing out-of-the-box.

Description

Core implementation (_checkpoint_storage.py):

  • CosmosCheckpointStorage implementing the CheckpointStorage protocol with all 6 methods (save, load, list_checkpoints, delete, get_latest, list_checkpoint_ids)
  • Authentication: supports both managed identity / RBAC (DefaultAzureCredential, ManagedIdentityCredential) and key-based auth, following the same pattern as CosmosHistoryProvider - Auto-creation of database and container on first use via create_database_if_not_exists
  • Partition key /workflow_name for efficient per-workflow queries
  • Reuses existing encode/decode_checkpoint_value for full Python object serialization fidelity

Tests (test_cosmos_checkpoint_storage.py):

  • 26 unit tests covering all protocol methods, auth modes, error handling, and save/load round-trip
  • Integration test validated against live Cosmos DB

Samples:

  • cosmos_workflow_checkpointing.py — standalone workflow checkpoint and resume with Cosmos DB
  • cosmos_workflow_checkpointing_foundry.py — end-to-end Azure AI Foundry agents with Cosmos DB checkpointing

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Aayush Kataria and others added 2 commits March 25, 2026 14:13
Add native Cosmos DB NoSQL support for workflow checkpoint storage in the
Python agent-framework-azure-cosmos package, achieving parity with the
existing .NET CosmosCheckpointStore.

New files:
- _checkpoint_storage.py: CosmosCheckpointStorage implementing the
  CheckpointStorage protocol with 6 methods (save, load, list_checkpoints,
  delete, get_latest, list_checkpoint_ids)
- test_cosmos_checkpoint_storage.py: Unit and integration tests
- workflow_checkpointing.py: Sample demonstrating Cosmos DB-backed
  workflow checkpoint/resume

Auth support:
- Managed identity / RBAC via Azure credential objects
  (DefaultAzureCredential, ManagedIdentityCredential, etc.)
- Key-based auth via account key string or AZURE_COSMOS_KEY env var
- Pre-created CosmosClient or ContainerProxy

Key design decisions:
- Partition key: /workflow_name for efficient per-workflow queries
- Serialization: Reuses encode/decode_checkpoint_value for full Python
  object fidelity (hybrid JSON + pickle approach)
- Container auto-creation via create_container_if_not_exists

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 25, 2026 22:16
@markwallace-microsoft markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 25, 2026
@github-actions github-actions bot changed the title Add Cosmos DB NoSQL Checkpoint Storage for Python Workflows Python: Add Cosmos DB NoSQL Checkpoint Storage for Python Workflows Mar 25, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Cosmos DB (NoSQL) checkpoint storage backend to the Python agent-framework-azure-cosmos package to enable durable workflow pause/resume (feature-parity with the .NET Cosmos checkpoint store).

Changes:

  • Introduces CosmosCheckpointStorage implementing workflow checkpoint persistence in Cosmos DB (auto-creates DB/container, partitions by workflow_name).
  • Adds unit + integration tests covering the checkpoint storage behavior.
  • Adds runnable samples + README updates showing Cosmos-backed workflow checkpointing (standalone and Azure AI Foundry).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/packages/azure-cosmos/agent_framework_azure_cosmos/_checkpoint_storage.py Implements Cosmos-backed checkpoint storage (save/load/list/delete/latest/ids).
python/packages/azure-cosmos/agent_framework_azure_cosmos/init.py Exposes CosmosCheckpointStorage from the package.
python/packages/azure-cosmos/tests/test_cosmos_checkpoint_storage.py Adds unit tests and an integration round-trip test for the new storage.
python/packages/azure-cosmos/samples/cosmos_workflow_checkpointing.py Standalone workflow sample using Cosmos-backed checkpointing.
python/packages/azure-cosmos/samples/cosmos_workflow_checkpointing_foundry.py Foundry multi-agent workflow sample using Cosmos checkpoint storage.
python/packages/azure-cosmos/samples/README.md Documents the new samples.
python/packages/azure-cosmos/README.md Documents CosmosCheckpointStorage usage and configuration.
python/packages/azure-cosmos/pyproject.toml Extends the integration test task to include the new integration test.

@markwallace-microsoft
Copy link
Member

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/azure-cosmos/agent_framework_azure_cosmos
   _checkpoint_storage.py135497%263–264, 389, 396
TOTAL28147341687% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5481 20 💤 0 ❌ 0 🔥 1m 21s ⏱️

@aayush3011
Copy link
Author

@markwallace-microsoft , please review the above PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants