Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.
- Introduction
- Objectives
- System Architecture
- Setup and Installation
- Considerations & Limitations
- Future Enhancements
Urban mobility is a critical concern for modern cities, requiring robust data solutions to enable informed decisions for transportation, infrastructure, and sustainability. This project delivers a scalable, modular data pipeline that ingests, processes, and analyzes diverse urban mobility datasets, empowering business intelligence and analytics teams to derive actionable insights.
- Aggregate and harmonize complex, multi-source public urban mobility datasets.
- Build a robust ELT (Extract, Load, Transform) pipeline for efficient data processing.
- Enable flexible analytics and visualization to support decision-making.
- Provide a foundation for advanced data science, reporting, and business intelligence use cases.
The repository is organized for clarity and modularity. Key components include:
├── 01-docker-terraform/
│ ├── 1_terraform_gcp/ # Infrastructure as Code for GCP (main.tf, variable.tf, keys/)
│ │ └── keys/ # GCP service account credentials
│ └── 2_docker_sql/ # Docker setup for SQL ingest
│ ├── Dockerfile # Custom Docker image for data ingestion
│ ├── docker-compose.yaml # Compose file for multi-container setup
│ ├── ingest_data.py # Python script for ingesting data into SQL
│ ├── pipeline.py # Data pipeline orchestration script
│ ├── *.parquet # Sample Parquet data files
│ ├── *.csv # Lookup tables and reference data
│ └── ny_taxi_postgres_data/ # PostgreSQL data directory (volumes)
├── 02-workflow-orchestration/
│ ├── docker-compose.yml # Compose file for workflow orchestration
│ └── workflow/ # Kestra workflow definitions (YAML)
├── 03-data-warehouse/
│ └── bigquery.sql # BigQuery schema and logic
├── 04-analytics-engineering/
│ └── taxi_rides_ny/ # dbt analytics engineering project
│ ├── dbt_project.yml # dbt project config
│ ├── models/ # dbt models (core, staging)
│ ├── seeds/ # Seed data (CSV, properties)
│ ├── macros/ # Custom dbt macros
│ ├── analyses/ # SQL analysis scripts
│ ├── snapshots/ # dbt snapshots
│ └── README.md # Project documentation
├── docs/ # Additional documentation and notes
│ └── note.txt # Project notes
├── images/ # Architecture, workflow, and data model diagrams
├── LICENSE # Project license
├── README.md # Main project documentation
The pipeline leverages Docker for containerization and Kestra for orchestration, following best practices for modern data engineering:
- Data Ingestion: Collects urban mobility data from public sources and APIs, loading it into MySQL for initial processing.
- Raw to Datalake: Employs Spark and Polars for transformation, storing optimized Parquet files in Google Cloud Storage(object storage).
- Data Warehouse: Loads cleansed, enriched data into Google BigQuery for advanced analytics and reporting.
- Transformations: Utilizes dbt for declarative transformations and modeling.
- Visualization: Powers business dashboards with Google Data Studio.
The Data Lake is structured to support scalability and performance:
- Raw Zone: Ingested data in its original format (CSV, JSON, XML).
- Processed Zone: Cleaned and transformed data in Parquet format.
The Data Warehouse with Star Schema modeling:
- Staging: Raw ingested data
- Fact & Dimensional: Aggregated, feature-engineered, and analytics-ready data
All files are stored in Parquet format for performance and scalability. Media files (e.g., images, sensor logs) follow a consistent naming scheme for seamless integration.
The data model is designed to support the analytical needs of urban mobility stakeholders. Key entities include:
- Git
- Docker (>= 4GB RAM, 6 cores, 16GB disk)
- CMake (for UNIX systems)
- Python 3.x (3.9.x recommended)
- pipenv or virtualenv
- Open ports: 3306, 5432, 9000, 9001, 3001, 8501, 4040, 7077, 8080, 3030
- Database client (e.g., DBeaver)
Clone the repository and configure environment variables:
git clone https://github.com/caogiathinh/Urban_Mobility_Pipeline.git
cd Urban_Mobility_Pipeline
cp .env.example .env
# Edit .env files with your credentials (see examples below)- Kestra: http://localhost:8080
- Google Data Studio Dashboard: https://lookerstudio.google.com/u/1/reporting/1a07c16e-3cef-4cbf-bac1-a8614d464323/page/yrJXF
- Development Only: Current setup is for development; production deployment requires further hardening.
- Schema Evolution: dbt transformations are modular; future schema changes should be versioned.
- Complete processing big data with Apache Spark.
- Implement testing, staging, and CI/CD pipelines.
- Expand dbt transformations for richer business logic.
- Complete streaming realtime with Apache Kafka.











