Skip to content

# SQLMesh FileNotFoundError on Context Initialization with Stale Cache #5712

@Tom-Fynes

Description

@Tom-Fynes

Issue Summary

When models are deleted from a SQLMesh project, the OptimizedQueryCache retains references to cache files for those deleted models. On subsequent Context initialization, SQLMesh attempts to stat() these stale cache files, resulting in a FileNotFoundError that prevents the Context from being created.

Environment Details

  • SQLMesh Version: 0.227.1
  • Python Version: 3.12
  • Deployment: Airflow on AWS ECS with persistent volumes
  • Cache Location: /tmp/airflow/sqlmesh_cache/my_project/.cache
  • State Backend: PostgreSQL (shared between multiple projects)

Error Details

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 468, in __init__
    self.load()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 696, in load
    update_model_schemas(
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/schema.py", line 28, in update_model_schemas
    optimized_query_cache: OptimizedQueryCache = OptimizedQueryCache(cache_dir)
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/cache.py", line 97, in __init__
    self._file_cache: FileCache[OptimizedQueryCacheEntry] = FileCache(
                                                            ^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/utils/cache.py", line 66, in __init__
    if not file.stem.startswith(self._cache_version) or file.stat().st_atime < threshold:
                                                        ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/pathlib.py", line 840, in stat
    return os.stat(self, follow_symlinks=follow_symlinks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/sqlmesh/my_project/.cache/optimized_query/0_227_27_28_100__my_project_my_model_name_2986343564'

Root Cause Analysis

The Problem

  1. Model Deletion Workflow:

    • When a model is removed from the SQLMesh project, the snapshot is removed from the state database
    • The physical table is dropped
    • However, the corresponding optimized_query cache files are NOT automatically cleaned up
  2. Cache Initialization:

    • During Context.__init__(), the OptimizedQueryCache is initialized
    • FileCache.__init__() scans the cache directory and calls file.stat() on every file it finds
    • This happens BEFORE any model selection or filtering logic runs
  3. Container Lifecycle:

    • In environments with persistent volumes (non-ephemeral containers), the cache accumulates over time
    • Unlike environments using /tmp that gets cleared on container restart, persistent cache directories retain stale entries indefinitely

Why Existing Solutions Don't Work

  1. sqlmesh clean: Requires creating a Context to run, which fails with the same FileNotFoundError
  2. no_auto_upstream=True: This flag only affects model selection during execution, not cache initialization
  3. Janitor process: Handles database state cleanup (interval compaction), not filesystem cache

Reproduction Steps

  1. Create a SQLMesh project with a model (e.g., my_project.my_model_name)
  2. Run the model at least once to generate cache entries
  3. Remove the model from the project
  4. In an environment with persistent cache (cache_dir configured to non-tmp location):
    • Try to create a new Context: Context(paths=project_path)
  5. Observe FileNotFoundError on stale cache file

Impact

Affected Scenarios

  • CI/CD Pipelines: Failed deployments when models are removed
  • Development: Developers unable to run SQLMesh locally after pulling changes that remove models
  • Production: Airflow DAGs failing when models are deprecated
  • Shared State Databases: When multiple projects share a state DB but have separate cache directories, cross-project model references can cause issues

Severity

High - Prevents Context initialization entirely, blocking all SQLMesh operations until manual intervention.

Current Workaround

Manual deletion of the optimized_query cache directory before Context initialization:

import shutil
from pathlib import Path

# Clear stale cache files to prevent FileNotFoundError during Context initialization.
# This happens when models are removed but cache references still exist.
# Manual deletion is required because `sqlmesh clean` itself needs Context initialization.
cache_path = Path(SQLMESH_CACHE_DIR)
if cache_path.exists():
    optimized_query_cache = cache_path / "optimized_query"
    if optimized_query_cache.exists():
        shutil.rmtree(optimized_query_cache)

# Now safe to create Context
context = Context(paths=SQLMESH_PROJECT_PATH)

Limitations of Workaround

  • Requires code changes in every place Context is created
  • Loses all cached query optimizations on every run
  • Not discoverable - developers hit the error before finding the solution

Suggested Fixes

Option 1: Graceful Degradation (Recommended)

Modify FileCache.__init__() to handle missing files gracefully:

# In sqlmesh/utils/cache.py
for file in cache_dir.glob(f"{self._cache_version}*"):
    try:
        stat_result = file.stat()
        if stat_result.st_atime < threshold:
            file.unlink(missing_ok=True)
    except FileNotFoundError:
        # File was deleted between glob and stat - this is expected for stale entries
        continue

Pros:

  • No breaking changes
  • Handles race conditions naturally
  • Self-healing behavior

Cons:

  • Leaves stale files until next access

Option 2: Automatic Cleanup on Model Deletion

Hook into the model deletion workflow to clean up associated cache files:

# When a snapshot is invalidated/removed
def _cleanup_model_cache(self, model_name: str) -> None:
    """Remove cache entries for a deleted model."""
    if self.config.cache_dir:
        cache_pattern = f"*__{model_name.replace('.', '_')}*"
        for cache_file in self.config.cache_dir.glob(cache_pattern):
            cache_file.unlink(missing_ok=True)

Pros:

  • Proactive cleanup
  • No stale files accumulate

Cons:

  • More complex implementation
  • Need to track model name → cache file mapping

Option 3: Validate Cache on Context Load

Add a validation step during Context initialization that removes invalid cache entries:

def validate_and_clean_cache(cache_dir: Path, valid_models: set[str]) -> None:
    """Remove cache entries for models that no longer exist."""
    for cache_file in cache_dir.glob("*"):
        model_name = extract_model_name(cache_file.name)
        if model_name not in valid_models:
            cache_file.unlink(missing_ok=True)

Pros:

  • Ensures cache consistency
  • Opportunity for other validation checks

Cons:

  • Adds overhead to Context initialization
  • Requires model discovery before cache is usable

Related Issues

  • Issue Pickling error for custom model kind #5187: Custom materialization type cache clearing workaround
  • Shared state database scenarios where multiple projects can create cross-project cache contamination (see Additional Context section below)

Additional Context

Multi-Project Shared State Database

When multiple SQLMesh projects share a PostgreSQL state database but maintain separate cache directories, the following can occur:

  1. Project A and Project B share state DB
  2. During plan(), the virtual layer update promotes ALL snapshots in the environment (including other project's models)
  3. If Project A's cache directory is under the project path (e.g., /opt/airflow/sqlmesh/project_a/.cache), it can accumulate references to Project B's models
  4. When Project B removes a model, Project A's cache still has references to it

This issue can be mitigated with isolated /tmp cache directories per project:

SQLMESH_CACHE_DIR = f"/tmp/airflow/sqlmesh_cache/{project_name}/.cache"

However, the core issue remains: cache entries are not tied to model lifecycle.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions