Skip to content

Conversation

@caohy1988
Copy link

Summary

This PR implements a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks. This addresses the "12-minute barrier" problem where agents lose state during long BigQuery jobs or other async operations.

Key Features

  • DurableSessionConfig: Configuration for durable cross-process checkpointing
  • BigQueryCheckpointStore: Two-phase commit checkpoint storage (BigQuery metadata + GCS blobs)
  • CheckpointableAgentState: Abstract interface for agents supporting durability
  • WorkspaceSnapshotter: GCS-based workspace directory snapshotting
  • Lease-based concurrency: Safe resume with optimistic locking

Implementation Highlights

Component Description
Two-phase commit GCS blob upload → BigQuery metadata insert (atomic visibility)
SHA-256 verification Checkpoint integrity verification on read
Async-first API All store methods are async for non-blocking I/O
Experimental decorators All public classes marked @experimental

Files Added

Core Module (src/google/adk/durable/)

  • config.py - DurableSessionConfig
  • checkpointable_state.py - CheckpointableAgentState ABC
  • stores/base_checkpoint_store.py - DurableSessionStore ABC
  • stores/bigquery_checkpoint_store.py - BigQuery + GCS implementation
  • workspace_snapshotter.py - GCS workspace snapshots

Demo (contributing/samples/long_running_task/)

  • agent.py - Demo agent with durable config
  • demo_server.py - FastAPI server with checkpoint APIs
  • demo_ui.html - Real-time visualization UI
  • long_running_task_design.md - Detailed design document

Tests (tests/unittests/durable/)

  • Unit tests for all components

Live Demo

A fully functional demo is deployed on Cloud Run:

URL: https://durable-demo-201486563047.us-central1.run.app

The demo showcases:

  • Real-time checkpoint visualization
  • Task failure simulation
  • Checkpoint-based recovery
  • BigQuery metadata queries
  • Final task output display

Infrastructure:

  • BigQuery Dataset: test-project-0728-467323.adk_metadata
  • GCS Bucket: gs://test-project-0728-467323-adk-checkpoints

Test plan

  • Unit tests pass for all durable components
  • End-to-end checkpoint write/read verified against real BigQuery/GCS
  • Demo deployed and functional on Cloud Run
  • Failure simulation and recovery tested
  • SHA-256 integrity verification tested

Design Document

See contributing/samples/long_running_task/long_running_task_design.md for the full design including:

  • Architecture overview
  • API contracts
  • BigQuery schema definitions
  • Security considerations
  • Cost estimation
  • Monitoring recommendations

🤖 Generated with Claude Code

@google-cla
Copy link

google-cla bot commented Feb 2, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Feb 2, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 2, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can proceed with the review, could you please address the following items from our contribution guidelines:

  • Sign our Contributor License Agreement (CLA): It looks like this may be your first contribution, and the CLA has not been signed yet. You can sign it at https://cla.developers.google.com/.
  • Associate a GitHub Issue: For new features like this, we require an associated GitHub issue to track the work. If one doesn't exist, could you please create one and link it in the PR description?

This information will help us to review your PR more efficiently. Thanks!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @caohy1988, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ADK's capabilities by introducing a robust durable session persistence layer. This new feature allows long-running agent tasks to maintain their state across process boundaries and system failures, ensuring continuity and reliability for complex, time-consuming operations. By leveraging BigQuery and Google Cloud Storage, it provides an auditable and scalable solution for managing agent progress, effectively overcoming limitations previously faced in cloud environments.

Highlights

  • Durable Session Persistence Layer: Introduces a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks and addressing the '12-minute barrier' problem.
  • Key Components: Adds DurableSessionConfig for configuration, BigQueryCheckpointStore for two-phase commit checkpoint storage (BigQuery metadata + GCS blobs), CheckpointableAgentState as an abstract interface for agents, and WorkspaceSnapshotter for GCS-based workspace directory snapshotting.
  • Concurrency and Integrity: Implements lease-based concurrency for safe resume with optimistic locking and includes SHA-256 verification for checkpoint integrity on read.
  • Async-first API: All store methods are designed as async for non-blocking I/O, ensuring efficient operation.
  • Comprehensive Demo: Includes a fully functional demo with an agent, FastAPI server, and real-time visualization UI, showcasing task failure simulation, checkpoint-based recovery, and BigQuery metadata queries.
  • Design Document and Review: A detailed design document (long_running_task_design.md) is added, along with a review feedback document (REVIEW_FEEDBACK.md) that critically assesses the design against existing ADK capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-designed durable session persistence layer, which is a significant feature for enabling long-horizon agents. The use of BigQuery for metadata and GCS for blobs is a robust pattern, and the implementation correctly includes key features like two-phase commits and lease-based concurrency. The accompanying demo is excellent for showcasing the functionality. My review identifies a few important issues to address, primarily concerning security (a hardcoded API key and a potential path traversal vulnerability), a race condition in session creation, and several opportunities for code refinement and improved maintainability. Overall, this is a strong feature addition, and addressing these points will make it even more robust.

Comment on lines 58 to 61
GOOGLE_CLOUD_API_KEY = os.environ.get(
"GOOGLE_CLOUD_API_KEY",
"AQ.Ab8RN6L12XpDo1x7Gf2w87EfspguWGrjZPW6XocNy2og_-z_jg",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

A default API key is hardcoded as a fallback value. This is a significant security risk, as it could be accidentally committed and exposed. Even for a demo, it's best practice to avoid hardcoding secrets. The application should fail explicitly if the key is not provided in the environment, rather than falling back to a hardcoded value.

GOOGLE_CLOUD_API_KEY = os.environ.get("GOOGLE_CLOUD_API_KEY")
if not GOOGLE_CLOUD_API_KEY:
    raise ValueError("GOOGLE_CLOUD_API_KEY environment variable not set.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Comment on lines +149 to +151
existing = await self.get_session(session_id=session_id)
if existing:
raise ValueError(f"Session {session_id} already exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a race condition here. Two concurrent requests could both check for an existing session, find none, and then both attempt to create it. Since BigQuery PRIMARY KEY constraints are not enforced, this could lead to duplicate session entries. The session creation logic should be made idempotent. One approach is to use a unique ID for the BigQuery insert job, which makes the insertion retryable and idempotent within a certain window.

Comment on lines +128 to +130
safe_members = [
m for m in tar.getmembers() if not m.name.startswith(("/", ".."))
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The check to prevent path traversal attacks (tar-slip) is insufficient. An attacker could craft a filename like a/../../etc/passwd which would bypass the current check. A more robust approach is to resolve the real path of each member and ensure it is within the intended destination directory before extraction. Using tar.extractall with a filtered list is risky; it's safer to iterate through members and extract them individually with proper path validation.

      for member in tar.getmembers():
          member_path = os.path.join(self._workspace_dir, member.name)
          # Resolve the absolute path and ensure it's within the workspace
          if os.path.realpath(member_path).startswith(os.path.realpath(self._workspace_dir)):
              tar.extract(member, self._workspace_dir)
          else:
              logger.warning("Skipping potentially unsafe path in tarball: %s", member.name)

async def list_sessions():
"""List all sessions from BigQuery."""
try:
client = checkpoint_store._get_bq_client()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing a "private" member _get_bq_client from outside the class is generally discouraged as it breaks encapsulation. If external access to the client is needed, consider providing a public property or method in the BigQueryCheckpointStore class.

Comment on lines +103 to +104
except Exception as e:
return {"sessions": [], "error": str(e)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide bugs and make debugging difficult. It's better to catch more specific exceptions that you expect to handle (e.g., exceptions from the BigQuery client). Additionally, returning a 200 OK status with an error message in the body for a failed API call is not standard practice. Consider raising an HTTPException with a 5xx status code to provide a more accurate API response.

Suggested change
except Exception as e:
return {"sessions": [], "error": str(e)}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to list sessions: {e}")


async def run_task_with_checkpoints(session_id: str, duration: int, resume: bool = False):
"""Run a long-running task with periodic checkpoints."""
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and to adhere to standard Python style (PEP 8), imports should be placed at the top of the file. Moving import random to the top module level will improve readability and consistency.

Comment on lines +77 to +78
with open("/tmp/lifecycle.json", "w") as f:
f.write(lifecycle_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded path like /tmp/lifecycle.json can be problematic in environments where /tmp is not writable or has specific restrictions (e.g., some serverless environments). It's more robust to use Python's tempfile module to create temporary files in a secure and platform-independent manner.

import tempfile

# ...

  with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".json") as tmp_file:
    tmp_file.write(lifecycle_config)
    lifecycle_path = tmp_file.name

  run_command(
      [
          "gsutil",
          "lifecycle",
          "set",
          lifecycle_path,
          f"gs://{GCS_BUCKET}",
      ],
      check=False,
  )
  os.remove(lifecycle_path)

active_lease_id=row.active_lease_id,
lease_expiry=row.lease_expiry,
ttl_expiry=row.ttl_expiry,
metadata=row.metadata if isinstance(row.metadata, dict) else (json.loads(row.metadata) if row.metadata else None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This complex one-liner for handling JSON parsing from BigQuery is duplicated in get_session, read_checkpoint, and list_checkpoints. To improve maintainability and reduce redundancy, this logic should be extracted into a private helper method.

… agents

This PR implements a durable session persistence layer for ADK, enabling
cross-process checkpoint-based recovery for long-running agent tasks.

## Key Features

- **DurableSessionConfig**: Configuration for durable cross-process checkpointing
- **BigQueryCheckpointStore**: Two-phase commit checkpoint storage (BQ metadata + GCS blobs)
- **CheckpointableAgentState**: Abstract interface for agents supporting durability
- **WorkspaceSnapshotter**: GCS-based workspace directory snapshotting

## Implementation Details

- Two-phase commit: GCS blob upload → BigQuery metadata insert
- SHA-256 checkpoint integrity verification
- Lease-based concurrency control for safe resume
- Async-first API design for non-blocking I/O

## Demo

A fully functional demo is deployed on Cloud Run showcasing:
- Real-time checkpoint visualization
- Task failure simulation and recovery
- BigQuery metadata queries
- Final task output display

Demo URL: https://durable-demo-201486563047.us-central1.run.app

## Files Added

- src/google/adk/durable/ - Core durable module
- contributing/samples/long_running_task/ - Demo agent and UI
- tests/unittests/durable/ - Unit tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@caohy1988 caohy1988 force-pushed the feature/durable-session-persistence branch from 99e7726 to 7d946ed Compare February 2, 2026 09:38
@ryanaiagent ryanaiagent self-assigned this Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants