Skip to content

fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1

Open
CLAV88 wants to merge 1 commit intocoder2j:mainfrom
CLAV88:main
Open

fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1
CLAV88 wants to merge 1 commit intocoder2j:mainfrom
CLAV88:main

Conversation

@CLAV88
Copy link
Copy Markdown

@CLAV88 CLAV88 commented May 2, 2026

What this PR fixes

This PR makes the tutorial work correctly on Google Colab,
which is how most beginners will run it.

Problems fixed

Environment cell breaks Colab — The os.environ SPARK_HOME
cell in every notebook hardcodes a local Mac path that does not
exist on Colab's VM. Replaced with a pip install pyspark setup
cell that works on any machine.

Relative data paths fail — All ./data/ paths replaced with
/content/data/ absolute paths that Colab can resolve.

Write cells fail on re-run — Added .mode("overwrite") to all
df.write cells. Default mode is error — fails every time after
the first run, which is the normal learner workflow.

Clone fails on re-run — Added existence check before git clone
calls. Without this, re-running any notebook throws exit code 128.

Dataset replacement

Replaced all synthetic sample data (5-20 rows) with the NYC TLC
Yellow Taxi dataset (Jan 2023, ~3M rows, 19 columns). Real financial
columns — fare_amount, tip_amount, payment_type — make every
exercise more meaningful and representative of production data work.

New files

  • README.md — rewritten with stage structure, Colab setup
    instructions, per-stage test questions, and embedded diagrams
  • AGENTS.md — AI-assisted teaching guidance and common error
    reference for learners using AI tools to work through the tutorial
  • assets/ — 4 SVG diagrams illustrating key concepts

Tested on

Google Colab, Spark 4.0.2, pip-installed pyspark

- Remove os.environ SPARK_HOME cell from all notebooks
  (hardcoded local Mac path breaks Colab — pip install pyspark
  bundles its own binaries, no SPARK_HOME needed)

- Replace all ./data/ relative paths with /content/data/
  (Colab resolves relative paths against a temp kernel directory
  that does not exist — absolute paths required)

- Add mode('overwrite') to all df.write cells
  (default write mode is error — fails on every re-run after first,
  which is the normal learner workflow)

- Add idempotent git clone guard in data bootstrap cells
  (exit code 128 on re-run because destination directory already
  exists — shutil.rmtree guard makes bootstrap safe to re-run)

- Replace synthetic sample data with NYC TLC Yellow Taxi dataset
  (Jan 2023, ~3M rows, 19 columns including fare_amount, tip_amount,
  payment_type, datetime fields — more representative of production
  data engineering than 5-row synthetic samples)

- Rewrite README.md with stage structure, Colab setup instructions,
  per-stage test questions, and embedded SVG diagrams

- Add AGENTS.md for AI-assisted learning discovery and teaching
  walkthrough guidance

- Add assets/ folder with 4 SVG diagrams:
  partition-vs-table, rdd-vs-dataframe, csv-vs-parquet,
  groupby-vs-window
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant