-
Notifications
You must be signed in to change notification settings - Fork 356
Description
Issue Summary
When models are deleted from a SQLMesh project, the OptimizedQueryCache retains references to cache files for those deleted models. On subsequent Context initialization, SQLMesh attempts to stat() these stale cache files, resulting in a FileNotFoundError that prevents the Context from being created.
Environment Details
- SQLMesh Version: 0.227.1
- Python Version: 3.12
- Deployment: Airflow on AWS ECS with persistent volumes
- Cache Location:
/tmp/airflow/sqlmesh_cache/my_project/.cache - State Backend: PostgreSQL (shared between multiple projects)
Error Details
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 468, in __init__
self.load()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/context.py", line 696, in load
update_model_schemas(
File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/schema.py", line 28, in update_model_schemas
optimized_query_cache: OptimizedQueryCache = OptimizedQueryCache(cache_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/core/model/cache.py", line 97, in __init__
self._file_cache: FileCache[OptimizedQueryCacheEntry] = FileCache(
^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/sqlmesh/utils/cache.py", line 66, in __init__
if not file.stem.startswith(self._cache_version) or file.stat().st_atime < threshold:
^^^^^^^^^^^
File "/usr/local/lib/python3.12/pathlib.py", line 840, in stat
return os.stat(self, follow_symlinks=follow_symlinks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/sqlmesh/my_project/.cache/optimized_query/0_227_27_28_100__my_project_my_model_name_2986343564'Root Cause Analysis
The Problem
-
Model Deletion Workflow:
- When a model is removed from the SQLMesh project, the snapshot is removed from the state database
- The physical table is dropped
- However, the corresponding
optimized_querycache files are NOT automatically cleaned up
-
Cache Initialization:
- During
Context.__init__(), theOptimizedQueryCacheis initialized FileCache.__init__()scans the cache directory and callsfile.stat()on every file it finds- This happens BEFORE any model selection or filtering logic runs
- During
-
Container Lifecycle:
- In environments with persistent volumes (non-ephemeral containers), the cache accumulates over time
- Unlike environments using
/tmpthat gets cleared on container restart, persistent cache directories retain stale entries indefinitely
Why Existing Solutions Don't Work
sqlmesh clean: Requires creating aContextto run, which fails with the sameFileNotFoundErrorno_auto_upstream=True: This flag only affects model selection during execution, not cache initialization- Janitor process: Handles database state cleanup (interval compaction), not filesystem cache
Reproduction Steps
- Create a SQLMesh project with a model (e.g.,
my_project.my_model_name) - Run the model at least once to generate cache entries
- Remove the model from the project
- In an environment with persistent cache (
cache_dirconfigured to non-tmp location):- Try to create a new
Context:Context(paths=project_path)
- Try to create a new
- Observe
FileNotFoundErroron stale cache file
Impact
Affected Scenarios
- CI/CD Pipelines: Failed deployments when models are removed
- Development: Developers unable to run SQLMesh locally after pulling changes that remove models
- Production: Airflow DAGs failing when models are deprecated
- Shared State Databases: When multiple projects share a state DB but have separate cache directories, cross-project model references can cause issues
Severity
High - Prevents Context initialization entirely, blocking all SQLMesh operations until manual intervention.
Current Workaround
Manual deletion of the optimized_query cache directory before Context initialization:
import shutil
from pathlib import Path
# Clear stale cache files to prevent FileNotFoundError during Context initialization.
# This happens when models are removed but cache references still exist.
# Manual deletion is required because `sqlmesh clean` itself needs Context initialization.
cache_path = Path(SQLMESH_CACHE_DIR)
if cache_path.exists():
optimized_query_cache = cache_path / "optimized_query"
if optimized_query_cache.exists():
shutil.rmtree(optimized_query_cache)
# Now safe to create Context
context = Context(paths=SQLMESH_PROJECT_PATH)Limitations of Workaround
- Requires code changes in every place Context is created
- Loses all cached query optimizations on every run
- Not discoverable - developers hit the error before finding the solution
Suggested Fixes
Option 1: Graceful Degradation (Recommended)
Modify FileCache.__init__() to handle missing files gracefully:
# In sqlmesh/utils/cache.py
for file in cache_dir.glob(f"{self._cache_version}*"):
try:
stat_result = file.stat()
if stat_result.st_atime < threshold:
file.unlink(missing_ok=True)
except FileNotFoundError:
# File was deleted between glob and stat - this is expected for stale entries
continuePros:
- No breaking changes
- Handles race conditions naturally
- Self-healing behavior
Cons:
- Leaves stale files until next access
Option 2: Automatic Cleanup on Model Deletion
Hook into the model deletion workflow to clean up associated cache files:
# When a snapshot is invalidated/removed
def _cleanup_model_cache(self, model_name: str) -> None:
"""Remove cache entries for a deleted model."""
if self.config.cache_dir:
cache_pattern = f"*__{model_name.replace('.', '_')}*"
for cache_file in self.config.cache_dir.glob(cache_pattern):
cache_file.unlink(missing_ok=True)Pros:
- Proactive cleanup
- No stale files accumulate
Cons:
- More complex implementation
- Need to track model name → cache file mapping
Option 3: Validate Cache on Context Load
Add a validation step during Context initialization that removes invalid cache entries:
def validate_and_clean_cache(cache_dir: Path, valid_models: set[str]) -> None:
"""Remove cache entries for models that no longer exist."""
for cache_file in cache_dir.glob("*"):
model_name = extract_model_name(cache_file.name)
if model_name not in valid_models:
cache_file.unlink(missing_ok=True)Pros:
- Ensures cache consistency
- Opportunity for other validation checks
Cons:
- Adds overhead to Context initialization
- Requires model discovery before cache is usable
Related Issues
- Issue Pickling error for custom model kind #5187: Custom materialization type cache clearing workaround
- Shared state database scenarios where multiple projects can create cross-project cache contamination (see Additional Context section below)
Additional Context
Multi-Project Shared State Database
When multiple SQLMesh projects share a PostgreSQL state database but maintain separate cache directories, the following can occur:
- Project A and Project B share state DB
- During
plan(), the virtual layer update promotes ALL snapshots in the environment (including other project's models) - If Project A's cache directory is under the project path (e.g.,
/opt/airflow/sqlmesh/project_a/.cache), it can accumulate references to Project B's models - When Project B removes a model, Project A's cache still has references to it
This issue can be mitigated with isolated /tmp cache directories per project:
SQLMESH_CACHE_DIR = f"/tmp/airflow/sqlmesh_cache/{project_name}/.cache"However, the core issue remains: cache entries are not tied to model lifecycle.
References
- Error Location:
sqlmesh/utils/cache.py:66inFileCache.__init__() - Related Issue: Pickling error for custom model kind #5187 (Custom materialization type cache clearing workaround)