Urban Mobility Pipeline

Built a complete end-to-end data platform to ingest, process, and analyze complex, multi-source public datasets for business intelligence.

Introduction

Urban mobility is a critical concern for modern cities, requiring robust data solutions to enable informed decisions for transportation, infrastructure, and sustainability. This project delivers a scalable, modular data pipeline that ingests, processes, and analyzes diverse urban mobility datasets, empowering business intelligence and analytics teams to derive actionable insights.

Objectives

Aggregate and harmonize complex, multi-source public urban mobility datasets.
Build a robust ELT (Extract, Load, Transform) pipeline for efficient data processing.
Enable flexible analytics and visualization to support decision-making.
Provide a foundation for advanced data science, reporting, and business intelligence use cases.

System Architecture

Directory Structure

The repository is organized for clarity and modularity. Key components include:

├── 01-docker-terraform/
│   ├── 1_terraform_gcp/           # Infrastructure as Code for GCP (main.tf, variable.tf, keys/)
│   │   └── keys/                  # GCP service account credentials
│   └── 2_docker_sql/              # Docker setup for SQL ingest
│       ├── Dockerfile             # Custom Docker image for data ingestion
│       ├── docker-compose.yaml    # Compose file for multi-container setup
│       ├── ingest_data.py         # Python script for ingesting data into SQL
│       ├── pipeline.py            # Data pipeline orchestration script
│       ├── *.parquet              # Sample Parquet data files
│       ├── *.csv                  # Lookup tables and reference data
│       └── ny_taxi_postgres_data/ # PostgreSQL data directory (volumes)
├── 02-workflow-orchestration/
│   ├── docker-compose.yml         # Compose file for workflow orchestration
│   └── workflow/                  # Kestra workflow definitions (YAML)
├── 03-data-warehouse/
│   └── bigquery.sql               # BigQuery schema and logic
├── 04-analytics-engineering/
│   └── taxi_rides_ny/             # dbt analytics engineering project
│       ├── dbt_project.yml        # dbt project config
│       ├── models/                # dbt models (core, staging)
│       ├── seeds/                 # Seed data (CSV, properties)
│       ├── macros/                # Custom dbt macros
│       ├── analyses/              # SQL analysis scripts
│       ├── snapshots/             # dbt snapshots
│       └── README.md              # Project documentation
├── docs/                         # Additional documentation and notes
│   └── note.txt                   # Project notes
├── images/                       # Architecture, workflow, and data model diagrams
├── LICENSE                       # Project license
├── README.md                     # Main project documentation

Pipeline Overview

The pipeline leverages Docker for containerization and Kestra for orchestration, following best practices for modern data engineering:

Data Ingestion: Collects urban mobility data from public sources and APIs, loading it into MySQL for initial processing.
Raw to Datalake: Employs Spark and Polars for transformation, storing optimized Parquet files in Google Cloud Storage(object storage).
Data Warehouse: Loads cleansed, enriched data into Google BigQuery for advanced analytics and reporting.
Transformations: Utilizes dbt for declarative transformations and modeling.
Visualization: Powers business dashboards with Google Data Studio.

Data Lake Architecture

The Data Lake is structured to support scalability and performance:

Raw Zone: Ingested data in its original format (CSV, JSON, XML).
Processed Zone: Cleaned and transformed data in Parquet format.

Data Warehouse Schema

The Data Warehouse with Star Schema modeling:

Staging: Raw ingested data
Fact & Dimensional: Aggregated, feature-engineered, and analytics-ready data

All files are stored in Parquet format for performance and scalability. Media files (e.g., images, sensor logs) follow a consistent naming scheme for seamless integration.

Data Model

The data model is designed to support the analytical needs of urban mobility stakeholders. Key entities include:

Setup and Installation

Prerequisites

Git
Docker (>= 4GB RAM, 6 cores, 16GB disk)
CMake (for UNIX systems)
Python 3.x (3.9.x recommended)
pipenv or virtualenv
Open ports: 3306, 5432, 9000, 9001, 3001, 8501, 4040, 7077, 8080, 3030
Database client (e.g., DBeaver)

Environment Configuration

Clone the repository and configure environment variables:

git clone https://github.com/caogiathinh/Urban_Mobility_Pipeline.git
cd Urban_Mobility_Pipeline
  cp .env.example .env
# Edit .env files with your credentials (see examples below)

User Interfaces

Kestra: http://localhost:8080
Google Data Studio Dashboard: https://lookerstudio.google.com/u/1/reporting/1a07c16e-3cef-4cbf-bac1-a8614d464323/page/yrJXF

Considerations & Limitations

Development Only: Current setup is for development; production deployment requires further hardening.
Schema Evolution: dbt transformations are modular; future schema changes should be versioned.

Future Enhancements

Complete processing big data with Apache Spark.
Implement testing, staging, and CI/CD pipelines.
Expand dbt transformations for richer business logic.
Complete streaming realtime with Apache Kafka.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Urban Mobility Pipeline

Table of Contents

Introduction

Objectives

System Architecture

Directory Structure

Pipeline Overview

Data Lake Architecture

Data Warehouse Schema

Data Model

Setup and Installation

Prerequisites

Environment Configuration

User Interfaces

Considerations & Limitations

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
.vscode		.vscode
01-docker-terraform		01-docker-terraform
02-workflow-orchestration		02-workflow-orchestration
03-data-warehouse		03-data-warehouse
04-analytics-engineering/taxi_rides_ny		04-analytics-engineering/taxi_rides_ny
docs		docs
images		images
LICENSE		LICENSE
README.md		README.md

License

caogiathinh/urban-mobility-elt-pipeline

Folders and files

Latest commit

History

Repository files navigation

Urban Mobility Pipeline

Table of Contents

Introduction

Objectives

System Architecture

Directory Structure

Pipeline Overview

Data Lake Architecture

Data Warehouse Schema

Data Model

Setup and Installation

Prerequisites

Environment Configuration

User Interfaces

Considerations & Limitations

Future Enhancements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages