A lightweight, heavily optimized Python CLI tool designed to synchronize models and metrics from a dbt project directly into an Apache Superset instance.
This project has been aggressively simplified to focus exclusively on dbt synchronization, ensuring maximum speed, minimal dependency footprint, and high maintainability.
- Automated Database Setup: Automatically provisions the
$targetdatabase connection in Superset using your localprofiles.yml. - Dataset Synchronization: Converts every dbt source and model matched by your
--selectcriteria into an interactive Dataset within Superset. - Metric Integration: Extracts and attaches metrics (both standard metrics and Semantic Layer/MetricFlow configurations) directly to their corresponding Superset datasets.
- Metadata Propagation: Synchronizes column descriptions, labels, and advanced metadata from your dbt models to Superset.
- Superset-specific Metadata: Pass Superset-exclusive settings (e.g.
cache_timeout) via themetatag in your dbt YML files.
This tool requires Python 3.8+.
# Clone the repository
git clone https://github.com/tmph2003/superset-sdk-python.git
cd superset-sdk-python
# Install the package locally
pip install -e .The CLI acts as a bridge between your compiled dbt manifest.json and your Superset instance API.
superset-cli https://superset-dev.sunhouse.com.vn/ target/manifest.json \
--username admin \
--password admin \
--project=sunhouse_etl_pipeline \
--profile=sunhouse_etl_pipeline \
--target=dev \
--profiles=profiles.yml \
--select models/gold/ \
--merge-metadata \
--max-workers=3The tool allows you to pass Superset-exclusive settings via the meta tag in your dbt configurations. This works for both Database connections (profiles.yml) and Datasets (models/*.yml).
When using --import-db, the CLI reads your dbt target to create the Superset database. You can customize the name of the database or enable SQL Lab features by adding a meta block to your target in profiles.yml:
sunhouse_etl_pipeline:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: my_gcp_project
dataset: my_dbt_dataset
# Pass Superset-specific overrides here
meta:
superset:
database_name: "Sunhouse Data Warehouse" # Overrides the default "{project}_{target}" name
cache_timeout: 86400 # Database-level cache timeout
expose_in_sqllab: true # Enable SQL Lab accessSimilarly, you can specify values for Superset-only fields directly in your dbt model definitions under the model.meta.superset.{{field_name}} key:
models:
- name: my_dbt_model
meta:
superset:
cache_timeout: 250 # Sets the dataset cache timeout to 250 seconds in Superset.
filter_select_enabled: trueRun superset-cli --help for a full list of configuration options:
--jwt-token: Authenticate via JWT token instead of username/password.--import-db: Import (or update) the database connection to Superset automatically.--select/-s: Select specific models or paths to sync (e.g.models/gold/).--exclude/-x: Exclude specific models from syncing.--metrics/-m: Select specific metrics to sync (comma-separated).--merge-metadata: Update Superset configurations based on dbt metadata while preserving Superset-only metrics.--preserve-metadata: Completely preserve existing column and metric configurations defined in Superset.--disallow-edits: Mark resources as managed externally to prevent users from editing them in the Superset UI.--max-workers: Control the number of parallel workers used for processing Semantic Layer metrics (default: 3).
If you wish to contribute or customize this tool, understanding the highly flattened, modular architecture is critical.
superset-sdk-python/
├── pyproject.toml # Modern Python build system metadata & dependencies
├── setup.cfg # Core packaging configs and entry point definitions
├── README.md # Documentation
├── tests/ # Pytest unit tests (run using `pytest -v`)
└── src/ # The core source code
The core codebase is divided into specialized layers to handle API interactions, authentication, and CLI orchestration.
This is where the orchestration of the synchronization happens.
command.py: The main entry point. Defines the Click CLI arguments, sets up logging, parses inputs, and directs the overall sync flow.databases.py: Handles importing or updating the main database connection inside Superset using credentials found in dbt'sprofiles.yml.datasets.py: The heavy lifter for models. Syncs physical tables/views into Superset Datasets, applies columns, data types, descriptions, and metrics.metrics.py: Extracts traditional dbt metric definitions, parses their aggregations, and structures them for Superset.metricflow.py: Specifically handles advanced MetricFlow / dbt Semantic Layer definitions (e.g. derived metrics, ratios) using concurrent workers.relations.py: Manages entity relationships, generating CTEs and complex join conditions based on Semantic Layer constraints.lib.py: Shared CLI utility functions, such as profile loading and model filtering logic.
Contains code for interpreting files and talking to external APIs.
clients/superset.py: A robust REST API client for Superset. Performs paginated GETs, POSTs, and PUTs to manipulate databases and datasets.clients/dbt.py: Usesmarshmallowschemas to safely and cleanly parse the massive dbtmanifest.jsonfiles.operators.py: Simple filtering operators (e.g.Equal,OneToMany) used to build queries for the Superset API.
Handles the nuances of logging in and maintaining sessions.
main.py: A generic base class that automatically intercepts401 Unauthorizedresponses and re-authenticates.superset.py: Implements Superset-specific login flows, including Username/Password login (which fetches a CSRF token) and JWT Token login.token.py: A lightweight implementation for simple Bearer token injection.
lib.py: Generic helpers like logging initialization and error payload validation (SIP-40 compliance).exceptions.py: Custom exception classes (CLIError,SupersetError,DatabaseNotFoundError) for clean error handling.
We use pytest for unit testing. The test suite is fast and covers metric parsing, dataset sync logic, and API serialization.
# Run all tests with verbosity
pytest -v