Skip to content

Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.

License

Notifications You must be signed in to change notification settings

caogiathinh/urban-mobility-elt-pipeline

Repository files navigation

Urban Mobility Pipeline

Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.


Table of Contents

  1. Introduction
  2. Objectives
  3. System Architecture
  4. Setup and Installation
  5. Considerations & Limitations
  6. Future Enhancements

Introduction

Urban mobility is a critical concern for modern cities, requiring robust data solutions to enable informed decisions for transportation, infrastructure, and sustainability. This project delivers a scalable, modular data pipeline that ingests, processes, and analyzes diverse urban mobility datasets, empowering business intelligence and analytics teams to derive actionable insights.

Objectives

  • Aggregate and harmonize complex, multi-source public urban mobility datasets.
  • Build a robust ELT (Extract, Load, Transform) pipeline for efficient data processing.
  • Enable flexible analytics and visualization to support decision-making.
  • Provide a foundation for advanced data science, reporting, and business intelligence use cases.

System Architecture

System Architecture

Directory Structure

The repository is organized for clarity and modularity. Key components include:

├── 01-docker-terraform/
│   ├── 1_terraform_gcp/           # Infrastructure as Code for GCP (main.tf, variable.tf, keys/)
│   │   └── keys/                  # GCP service account credentials
│   └── 2_docker_sql/              # Docker setup for SQL ingest
│       ├── Dockerfile             # Custom Docker image for data ingestion
│       ├── docker-compose.yaml    # Compose file for multi-container setup
│       ├── ingest_data.py         # Python script for ingesting data into SQL
│       ├── pipeline.py            # Data pipeline orchestration script
│       ├── *.parquet              # Sample Parquet data files
│       ├── *.csv                  # Lookup tables and reference data
│       └── ny_taxi_postgres_data/ # PostgreSQL data directory (volumes)
├── 02-workflow-orchestration/
│   ├── docker-compose.yml         # Compose file for workflow orchestration
│   └── workflow/                  # Kestra workflow definitions (YAML)
├── 03-data-warehouse/
│   └── bigquery.sql               # BigQuery schema and logic
├── 04-analytics-engineering/
│   └── taxi_rides_ny/             # dbt analytics engineering project
│       ├── dbt_project.yml        # dbt project config
│       ├── models/                # dbt models (core, staging)
│       ├── seeds/                 # Seed data (CSV, properties)
│       ├── macros/                # Custom dbt macros
│       ├── analyses/              # SQL analysis scripts
│       ├── snapshots/             # dbt snapshots
│       └── README.md              # Project documentation
├── docs/                         # Additional documentation and notes
│   └── note.txt                   # Project notes
├── images/                       # Architecture, workflow, and data model diagrams
├── LICENSE                       # Project license
├── README.md                     # Main project documentation

Pipeline Overview

Orchestration Workflow Orchestration Workflow Yellow data Orchestration Workflow Green data Result Workflow

The pipeline leverages Docker for containerization and Kestra for orchestration, following best practices for modern data engineering:

  1. Data Ingestion: Collects urban mobility data from public sources and APIs, loading it into MySQL for initial processing.
  2. Raw to Datalake: Employs Spark and Polars for transformation, storing optimized Parquet files in Google Cloud Storage(object storage).
  3. Data Warehouse: Loads cleansed, enriched data into Google BigQuery for advanced analytics and reporting.
  4. Transformations: Utilizes dbt for declarative transformations and modeling.
  5. Visualization: Powers business dashboards with Google Data Studio.

Data Lake Architecture

Datalake Storage The Data Lake is structured to support scalability and performance:

  • Raw Zone: Ingested data in its original format (CSV, JSON, XML).
  • Processed Zone: Cleaned and transformed data in Parquet format.

Data Warehouse Schema

Data Warehouse Schema

The Data Warehouse with Star Schema modeling:

  • Staging: Raw ingested data
  • Fact & Dimensional: Aggregated, feature-engineered, and analytics-ready data

Data Warehouse partition & clustering

All files are stored in Parquet format for performance and scalability. Media files (e.g., images, sensor logs) follow a consistent naming scheme for seamless integration.

Data Model

The data model is designed to support the analytical needs of urban mobility stakeholders. Key entities include:

Yellow Data Model Green Data Model with dbt Merged Data Model with dbt

Setup and Installation

Docker Manager:

Prerequisites

  • Git
  • Docker (>= 4GB RAM, 6 cores, 16GB disk)
  • CMake (for UNIX systems)
  • Python 3.x (3.9.x recommended)
  • pipenv or virtualenv
  • Open ports: 3306, 5432, 9000, 9001, 3001, 8501, 4040, 7077, 8080, 3030
  • Database client (e.g., DBeaver)

Environment Configuration

Clone the repository and configure environment variables:

git clone https://github.com/caogiathinh/Urban_Mobility_Pipeline.git
cd Urban_Mobility_Pipeline
  cp .env.example .env
# Edit .env files with your credentials (see examples below)

User Interfaces

Dashboard

Considerations & Limitations

  • Development Only: Current setup is for development; production deployment requires further hardening.
  • Schema Evolution: dbt transformations are modular; future schema changes should be versioned.

Future Enhancements

  • Complete processing big data with Apache Spark.
  • Implement testing, staging, and CI/CD pipelines.
  • Expand dbt transformations for richer business logic.
  • Complete streaming realtime with Apache Kafka.

About

Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published