Skip to content

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346

Draft
vahid-ahmadi wants to merge 15 commits intomainfrom
feat/panel-persistent-ids-345
Draft

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346
vahid-ahmadi wants to merge 15 commits intomainfrom
feat/panel-persistent-ids-345

Conversation

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

@vahid-ahmadi vahid-ahmadi commented Apr 17, 2026

Summary

Turns the single-year cross-sectional pipeline into a full dynamic panel microsimulation: real ONS demographic rates, real UKHLS-estimated within-person transitions, and a one-call composer that ages the UK population forward year-by-year. Closes #345.

No change to the default build. Every new behaviour is opt-in via create_yearly_snapshots(mode="full_panel") or by calling advance_year() directly.

The PR grew into four layers:

  1. Panel-ID plumbing (steps 1-2 of Generate panel data across years (synthetic panel on FRS 2023-24 base) #345): ID contract + yearly-snapshot helper.
  2. Dynamics: demographic ageing + the five life-cycle transitions (marriage, separation, children leaving home, migration, within-person income/employment).
  3. Real data: ONS mortality / fertility / marriage / divorce / leaving-home / migration statistics, and UKHLS panel microdata (Waves 1-15) for within-person transitions.
  4. Composition: advance_year and a mode="full_panel" on create_yearly_snapshots to chain it all together deterministically under one seed.

What's in the pipeline

Every component below ships as opt-in machinery — callers decide which rates to use and when to run transitions.

Panel identity and snapshots

  • utils/panel_ids.py: PANEL_ID_COLUMNS, get_panel_ids, assert_panel_id_consistency, classify_panel_ids, PanelIDTransition. Documents household_id / benunit_id / person_id as the panel keys.
  • datasets/yearly_snapshots.py: create_yearly_snapshots(base, years, output_dir, *, mode, ...).
    • mode="uprate_only" (default): uprate each target year, IDs preserved byte-for-byte.
    • mode="full_panel": roll the population forward year-by-year via advance_year; IDs evolve (deaths / births / migration).
  • storage/upload_yearly_snapshots.py: parallel uploader to the existing private HF destination. Destination constants are not exposed as arguments — redirecting the upload requires a code edit reviewed under CLAUDE.md data-protection rules.

Demographic ageing

  • utils/demographic_ageing.py: age_dataset(base, years, *, mortality_rates, fertility_rates, seed).
  • Mortality removes person rows, fertility appends newborns attached to the mother's benunit + household, age increments.
  • Accepts either age-only Mapping[int, float] or sex-specific Mapping[str, Mapping[int, float]] for mortality.

Household-composition transitions

All in utils/household_transitions.py:

  • apply_marriages: pair unmarried adults within region, closest-age match, merge benunits (weights folded).
  • apply_separations: split two-adult benunits; children default to the mother.
  • apply_children_leaving_home: move adult dependents (both FRS shapes — dependent young adults on parents' benunit, and separate benunits sharing the parental household) into fresh benunit + household.
  • apply_migration: Poisson-distributed net inflow / outflow per age, inflow cloned from same-age donors, outflow removed cleanly.

Labour-market transitions

  • apply_employment_transitions(dataset, *, ukhls_rates=None, ...):
    • Rule-based (default): retirement at SPA → CPI-plus wage drift → job loss / gain.
    • UKHLS-driven (opt-in): classifies each working-age person into a four-state label and draws the next state from the empirical age-band × sex transition matrix estimated from UKHLS.
  • apply_income_decile_transitions: ranks workers into deciles within (age_band, sex), draws a destination decile from the UKHLS matrix, rescales income by the ratio of target/origin decile medians.

Year-aware calibration plumbing

  • targets/build_loss_matrix.py: resolve_target_value(target, year, *, tolerance=3), constant YEAR_FALLBACK_TOLERANCE = 3.
  • Two real bug fixes picked up along the way:
    • local_authorities/loss.py was reading household_weight at a hard-coded 2025.
    • constituencies/loss.py was reading dataset.time_period instead of the passed time_period argument in two places.
      Both now honour the argument.
  • docs/targets_coverage.md: year coverage across every target category plus the remaining data-sourcing gaps.

Cross-year smoothness

  • utils/calibrate.py: compute_log_weight_smoothness_penalty, plus two keyword-only kwargs on calibrate_local_areas (prior_weights, smoothness_penalty). When both are supplied, a log-space L2 penalty pulls the current year's weights towards the prior year's. Off by default — pre-step-5 behaviour is reproduced bit-for-bit.

Top-level composer

  • utils/advance_year.py: advance_year(dataset, *, target_year, seed, ...). Chains migration → separation → leaving-home → marriage → employment transitions → income-decile transitions → mortality → fertility → age increment → uprating. One seeded RNG so the whole year is reproducible.

Real-data backing

All seven rate sources are either public ONS data (committed alongside the existing ONS xlsx fixtures in storage/) or aggregated UKHLS derivatives (cell-suppression ≥ 10 applied, raw microdata never committed).

Transition Source Location on branch
Mortality ONS National Life Tables UK (1980–2024, age × sex × qx) targets/sources/ons_mortality.py + storage/ons_national_life_tables.xlsx
Fertility ONS Births in England and Wales: registrations Table 10 targets/sources/ons_fertility.py + storage/ons_asfr.xlsx
Marriage ONS Marriage Statistics England & Wales utils/household_transitions.py
Separation ONS Divorce Statistics England & Wales utils/household_transitions.py
Children leaving home ONS LFS Young adults living with parents utils/household_transitions.py
Migration ONS Long-Term International Migration utils/household_transitions.py
Within-person income / employment UKHLS Waves 1–15 (SN 6614), 601,795 person-wave obs datasets/ukhls.py + utils/ukhls_transitions.py + committed aggregated CSVs

UKHLS pipeline detail

  • datasets/ukhls.py reads the UKDS-licensed .dta files from storage/ukhls/ (git-ignored; raw microdata never leaves the machine).
  • utils/ukhls_transitions.py pairs consecutive waves on pidp (cross-wave person key) and estimates:
    • P(state_{t+1} | state_t, age_band, sex) — four-state (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE)
    • P(decile_{t+1} | decile_t, age_band, sex) — within-wave income deciles
  • Cells with fewer than 10 observations are dropped (ONS Safe Setting convention). Probabilities re-normalised within surviving (age_band, sex, state_from) row groups.
  • Aggregated outputs committed to storage/:
    • ukhls_employment_state_transitions.csv (≈16 KB)
    • ukhls_income_decile_transitions.csv (≈94 KB)
  • Spot-check validation vs published statistics:
    • MALE 25–29 IN_WORK → IN_WORK = 95.6 % (matches ONS LFS 2-quarter flow rate).
    • MALE 25–29 UNEMPLOYED → IN_WORK = 36.5 % (consistent with ONS published ~35–40 %).
    • Avg probability of staying in the same income decile year-on-year = 39.9 % (consistent with IFS Living Standards mobility estimates).

Data-protection safeguards

  • **/*.dta and storage/ukhls/ are gitignored (both directory and symlink forms).
  • Only aggregated, cell-suppressed transition tables are committed.
  • storage/upload_yearly_snapshots.py locks the private HF repo + GCS bucket constants at function level; redirecting the upload requires a reviewable code edit.

What's not in this PR (explicit non-goals)

  • create_datasets.py:main() refactor to invoke create_yearly_snapshots automatically — kept as a follow-up so the change-of-default conversation happens in its own PR.
  • Per-year calibration against year-specific targets — plumbing is here (resolve_target_value, smoothness penalty), but the full year-loop wiring belongs in a separate step.
  • Smoothness-penalty coefficient tuning — empirical question best answered against full loss matrices, deferred.
  • policyengine-uk changes — runtime-uprating skip flags, fixture migration — documented in docs/panel_downstream.md, left to the sibling repo.
  • Same-sex marriage, cohabitation, second-marriage modelling — deferred.
  • Admin-linked data (DWP Safe Access UKHLS variants) — require Secure Access, not a practical next step.
  • End-to-end smoke test on the real enhanced_frs_2023_24.h5 — 176 in-tree tests cover every unit and a composed advance_year run, but the full-panel dry-run through create_yearly_snapshots is a follow-up I'll attach before moving this out of Draft.

Test plan

176 new tests across the stack, all green locally; no real FRS or UKHLS microdata touched in CI (every test uses in-memory synthetic fixtures or committed aggregate CSVs).

test_advance_year.py                          7
test_age_dataset_sex_specific_mortality.py    5
test_apply_children_leaving_home.py           8
test_apply_employment_transitions.py         14
test_apply_income_decile_transitions.py       7
test_apply_marriages.py                       9
test_apply_migration.py                       9
test_apply_separations.py                     8
test_calibrate_smoothness_integration.py      5
test_conftest_fixtures.py                     7
test_demographic_ageing.py                   23
test_ons_fertility.py                         9
test_ons_mortality.py                         7
test_panel_ids.py                            10
test_resolve_target_value.py                 12
test_smoothness_penalty.py                   10
test_ukhls_transitions.py                     9
test_upload_yearly_snapshots.py               7
test_yearly_snapshots.py                     10

Notable coverage

  • Upload safety: the private HF repo, HF repo type and GCS bucket are locked by test; the function signature is asserted to only allow years and storage_folder as kwargs.
  • No-partial-upload invariant: a missing file aborts the upload call before any network traffic.
  • Smoothness regulariser: gradient masking on zero-prior entries, log-space symmetry, large-penalty integration test that proves pull-towards-prior.
  • Panel-ID contract: survivors, deaths, and births classified correctly across a single age_dataset run.
  • Disclosure control: UKHLS transition tests build a tiny cohort and confirm every low-count cell is suppressed before the CSV is written.
  • UKHLS integration: hermetic synthetic panel recovers planted transition probabilities within Monte-Carlo noise, plus a @pytest.mark.skipif path that exercises the committed aggregate CSVs when present.
  • Determinism: every stochastic transition has a same-seed → same-output test.

Running a one-year panel roll

from policyengine_uk_data.utils.advance_year import advance_year
from policyengine_uk_data.utils.ukhls_transitions import (
    load_employment_transitions, load_income_decile_transitions,
)
from policyengine_uk_data.targets.sources.ons_mortality import get_mortality_rates
from policyengine_uk_data.targets.sources.ons_fertility import get_fertility_rates

out = advance_year(
    base,
    target_year=2024,
    seed=0,
    mortality_rates=get_mortality_rates(2024),
    fertility_rates=get_fertility_rates(2024),
    ukhls_employment_rates=load_employment_transitions(),
    ukhls_decile_rates=load_income_decile_transitions(),
)

Tracks and closes: #345.

vahid-ahmadi and others added 3 commits April 17, 2026 11:11
First step towards the per-year panel pipeline described in #345: document
that household_id, benunit_id and person_id are the panel keys that must
be preserved across yearly snapshots, and add a reusable
`assert_panel_id_consistency` utility so future year-loop code can enforce
the invariant at save time and in tests.

No behaviour change to the current single-year pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unblocks the Lint check on #346.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a standalone helper that takes an already-imputed base dataset and
produces one `enhanced_frs_<year>.h5` file per requested year by calling
`uprate_dataset` and saving. Every snapshot is verified against the base
with `assert_panel_id_consistency` at save time, so any future step that
mutates the person/benunit/household tables (e.g. demographic ageing in
step 3) cannot silently break the panel key contract.

Deliberately out of scope for this PR — tracked in #345:
- per-year calibration (needs year-specific targets, step 4)
- demographic ageing (step 3)
- restructuring `create_datasets.py:main()` to call this helper

The existing single-year pipeline is untouched; callers opt in to panel
output by invoking `create_yearly_snapshots` directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Add panel ID contract and consistency utility (step 1 of #345) Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345) Apr 17, 2026
Introduces `age_dataset(base, years, *, seed, mortality_rates, fertility_rates)`
— the minimum-viable demographic ageing described in the plan. Per year
step:

- every surviving person's `age` column is incremented,
- persons sampled as dying are removed,
- new babies are appended with fresh, non-colliding `person_id` values and
  attached to the mother's existing benefit unit and household.

Deterministic via the `seed` argument. Placeholder mortality and fertility
tables ship with the module so it runs end-to-end in tests — they are
explicitly named `_PLACEHOLDER` and are due to be replaced by real ONS
life tables and fertility rates in a follow-up.

Also extends `utils/panel_ids` with `classify_panel_ids(base, other)`
and a `PanelIDTransition` dataclass so tests and diagnostics can describe
the survivors / deaths / births move without tripping the strict
`assert_panel_id_consistency` check (which remains the right tool for
uprating-style transforms that must not change ID sets).

Out of scope, tracked in #345:
- real ONS life tables and fertility rates,
- marriage, separation and leaving-home dynamics,
- migration,
- integration into `create_yearly_snapshots` — callers chain `age_dataset`
  and `uprate_dataset` themselves for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345) Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345) Apr 17, 2026
Fixes two concrete bugs that would have prevented calibrating the same
base dataset at a year other than its stored `time_period`:

- `local_authorities/loss.py` read household weights at a hard-coded 2025
  when computing the national-total fallbacks used for LAs missing ONS
  data. Now uses the explicit `time_period` argument.
- `constituencies/loss.py` passed `dataset.time_period` to
  `get_national_income_projections` and `sim.default_calculation_period`
  even when the caller supplied a different `time_period`. Same fix.

Also extracts the year-resolution logic from `build_loss_matrix._resolve_value`
into a documented public function `resolve_target_value`, names the
three-year tolerance as a constant, and adds 12 unit tests covering the
fallback policy (exact match, nearest past year, tolerance limit, no
backwards extrapolation, VOA population scaling).

Ships `docs/targets_coverage.md` documenting year coverage across every
target category and where the real gaps are (DWP 2026+, local-area CSV
refreshes). No new data sourced in this PR — sourcing is deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345) Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345) Apr 17, 2026
Adds an opt-in log-space L2 penalty to the training loss in
`calibrate_local_areas` that pulls the optimised weights towards a prior
year's weights. This is the regulariser that makes a sequence of
per-year calibrations statistically coherent as a panel — without it,
the same household can represent, say, 500 units in 2024 and 50 in 2025.

Design choices:

- The penalty is factored out into a pure helper
  `compute_log_weight_smoothness_penalty(log_weights, prior_weights)`
  so it can be unit-tested thoroughly. Entries where the prior is zero
  (households outside an area's country) are excluded from the mean so
  they neither pull nor inflate the penalty.
- `calibrate_local_areas` gains two keyword-only kwargs, `prior_weights`
  and `smoothness_penalty`, both defaulting to values that reproduce the
  pre-step-5 training loop exactly.
- Shape mismatches raise a clear `ValueError` rather than failing
  deep inside the optimiser.
- The penalty is computed from the underlying log-space weights (not
  the dropout-augmented tensor fed into the fit loss) so the regulariser
  does not double-count the dropout noise.

Tests (15 new, all in two files):

- 10 unit tests on the helper covering zero-when-equal, quadratic
  scaling, masking of zero-prior entries, gradient masking, shape
  validation, symmetric log deviation, differentiability, dtype
  round-trip and a hand-computed heterogeneous case.
- 5 integration tests on `calibrate_local_areas` with a three-household
  fake dataset: default kwargs reproduce pre-step-5 behaviour, shape
  mismatch raises, `None` prior + penalty is a no-op, zero penalty +
  prior is a no-op, and a large penalty measurably pulls weights
  towards the prior versus a no-smoothness run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345) Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345) Apr 17, 2026
@vahid-ahmadi vahid-ahmadi self-assigned this Apr 17, 2026
Final step of the plan in #345: the consumer-facing plumbing that lets
something actually use per-year panel snapshots.

- `policyengine_uk_data/storage/upload_yearly_snapshots.py` — a
  deliberately parallel uploader to `upload_completed_datasets.py`.
  Pointed at the same private HuggingFace repo and GCS bucket; the
  destination constants are not exposed as function arguments, so
  redirecting the upload requires a code edit reviewed under
  CLAUDE.md's data-protection rules. The existing
  `upload_completed_datasets.py` is untouched.
- `policyengine_uk_data/tests/conftest.py` — adds `enhanced_frs_for_year`
  factory fixture. Resolves `enhanced_frs_<year>.h5` and falls back to
  the legacy `enhanced_frs_2023_24.h5` for the 2023 base year so existing
  tests keep passing without modification. Skips (rather than errors) if
  the requested year's file is missing.
- `docs/panel_downstream.md` — coordination note for sibling
  `policyengine-uk` repo: runtime-uprating skip options, fixture
  migration pattern, sensible default year set.

Tests (14 new):

- 7 on the uploader: pure path construction, iterable acceptance,
  empty-list rejection, missing-file rejection with no partial upload,
  upload-arguments lock to the private destination, destination
  constants locked to private repo, function signature does not allow
  redirect via kwargs.
- 7 on the fixture factory: resolves `enhanced_frs_<year>.h5`, skips
  cleanly when year is missing, falls back to legacy filename for 2023,
  prefers the new filename when both exist, accepts int and str years,
  existing `enhanced_frs` fixture still points at legacy name,
  `STORAGE_FOLDER` export is not accidentally shadowed.

Out of scope, flagged in `docs/panel_downstream.md`:

- Modifying `policyengine-uk` itself (separate repo).
- Changing the audited upload destinations.
- Actual decision on skip-vs-always-uprate at simulation time — the doc
  presents the two options and the tradeoffs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345) Panel pipeline: full plumbing for per-year snapshots (closes #345) Apr 17, 2026
CI's pandas returned the panel gender column as StringDtype rather than
the plain object dtype I saw locally. `numpy.ndarray.astype` only
accepts numpy-compatible dtypes, so passing `StringDtype` on line 186
raised TypeError: Cannot interpret '<StringDtype(...)>' as a data type.

Fix: use object arrays when building newborn rows and let pandas coerce
them back to the template's extension dtype during `pd.concat`. Same
pattern applied to other non-numeric template columns in
`_build_newborn_rows`. Adds a regression test that explicitly casts
`gender` to pandas StringDtype before calling `age_dataset`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vahid-ahmadi and others added 7 commits April 20, 2026 12:43
Adds `targets/sources/ons_mortality.py`, which parses the UK National
Life Tables workbook (~600 KB, shipped alongside the other ONS xlsx
fixtures in `policyengine_uk_data/storage/`) into a long-format frame
keyed by period / sex / age / qx. Exposes `get_mortality_rates(year)`
returning `{sex: {age: qx}}` for the 3-year rolling period covering a
calendar year (with nearest-past fallback), plus a unisex helper.

Extends `age_dataset` in `utils/demographic_ageing.py` to accept the
sex-specific mapping shape in addition to the existing age-only
mapping. Detection is by key type, so the existing placeholder rates
and every current test continue to work unchanged.

Placeholder mortality rates are kept as the fallback default, but the
docstring now points callers at `get_mortality_rates` for real data.

Test coverage: 7 loader tests against a synthetic in-tree workbook
(period resolution, nearest-past fallback, unisex averaging, non-
period sheet filtering) plus 5 age_dataset integration tests
(backwards-compat, sex-specific kill/spare behaviour, missing-sex
fallback, missing-age default, real-rate shape sanity on a toy pop).
All 35 tests in the affected modules pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `targets/sources/ons_fertility.py`, which parses Table 10 of the
ONS *Births in England and Wales: registrations* workbook (~840 KB,
shipped alongside the other ONS xlsx fixtures in
`policyengine_uk_data/storage/`). Exposes
`load_ons_fertility_rates()` returning a long-format frame keyed by
year / country / age_low / age_high / rate_per_1000, and
`get_fertility_rates(year, country=...)` returning a single-year
`{age: probability}` map that plugs straight into
`age_dataset(..., fertility_rates=...)`.

Handles the ONS band format:

- "Under 20" → ages 15-19 (conventional start of the fertility window).
- "20 to 24" ... "35 to 39" → ages 20-24 ... 35-39, uniform within band.
- "40 and over" → ages 40-44 only (5-year cap). Expanding an open band
  uniformly across the whole fertility window would otherwise overstate
  ASFR at ages 45+ by an order of magnitude, since the overwhelming
  majority of 40+ births happen at 40-44.
- Rates converted from births-per-1 000 to per-woman-per-year
  probability.

Year resolution: exact match preferred, nearest past year as fallback
(mirrors the mortality loader). Future years silently fall back to the
latest available; pre-1938 requests raise a clear KeyError.

Test coverage: 9 new loader/integration tests against a synthetic
in-tree workbook (year fallback, open-band cap, under-20 lower bound,
country filter, probability scaling, end-to-end age_dataset
integration). Zero network access in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…345)

Introduces `utils/household_transitions.py` with five life-cycle
transitions that complement the existing mortality + fertility
mechanics:

- `apply_marriages`: pair single adults with an in-region opposite-sex
  partner (closest-age match), merge benunits and fold weights. Uses
  age × sex rates (ONS Marriage Statistics smoothed averages).
- `apply_separations`: split two-adult benunits with the couple's
  mean-age rate (ONS Divorce Statistics). Children attach to the
  mother by default; overridable. New benunit + household rows are
  minted for the mover and regions are preserved.
- `apply_children_leaving_home`: move adult dependents out of their
  parents' benunit + household. Handles both FRS shapes (dependent
  young adult on parents' benunit, or adult child with their own
  single benunit inside the parental household). Uses age-indexed
  rates (ONS LFS "Young adults living with parents").
- `apply_migration`: Poisson-distributed net inflow/outflow by age
  (ONS Long-Term International Migration estimates). Immigration
  clones donor rows at the same age; emigration randomly removes
  rows and cleans up orphaned benunit/household rows.
- `apply_employment_transitions`: rule-based placeholder for
  within-person labour-market moves — retirement at state-pension
  age, CPI-plus wage drift, and configurable job loss/gain rates
  with nearest-age income donor for gainers. Will be replaced with
  UKHLS-estimated rates in a follow-up.

All functions are pure (no mutation), deterministic given an explicit
RNG, and use only columns and ID shapes already present in the FRS /
pe-uk schema. `is_married`, `is_single`, `is_couple` etc. pick up the
changes automatically because they are derived from adult counts in
benunits.

44 new tests across the five modules cover: rate-zero is a no-op,
rate-one produces maximal transitions, derived boolean flags flip
correctly, benunit/household rows stay consistent, deterministic
under a fixed seed, and the default rates produce sensible aggregates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds *.dta to the gitignore pattern and explicitly ignores
policyengine_uk_data/storage/ukhls (both directory and symlink forms).
UKHLS is UKDS-licensed under the same EUL regime as the FRS; per
CLAUDE.md, individual-level records must never enter the repo. Only
aggregate, non-disclosive outputs derived from this data may be
tracked.
Adds the UKHLS side of the panel pipeline — the last piece needed for
real within-person year-on-year dynamics, complementing the ONS
mortality / fertility / marriage / separation / leaving-home /
migration rates already on the branch.

Three pieces:

- `datasets/ukhls.py`: loader for the UKDS-licensed Main Survey
  indresp files (Waves 1-15, study SN 6614). Normalises wave-letter
  prefixes away, harmonises the rich jbstat enum into a four-state
  labour market label (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE).
  The raw .dta files live under `storage/ukhls/` and are git-ignored
  along with all .dta patterns; only this loader sees individual
  rows.

- `utils/ukhls_transitions.py`: estimator that pairs consecutive
  waves on `pidp` (the cross-wave person key) and computes
  P(state_t+1 | state_t, age_band, sex) for employment states and
  P(decile_t+1 | decile_t, age_band, sex) for income deciles. Cells
  with fewer than 10 observations are suppressed to meet ONS Safe
  Setting disclosure-control conventions; probabilities are
  renormalised post-suppression so each (age_band, sex, state_from)
  group still sums to 1.

- Aggregated outputs, committed to `storage/`:
    - `ukhls_employment_state_transitions.csv` (15.6 KB)
    - `ukhls_income_decile_transitions.csv` (94 KB)
  Both are safe to ship. No individual-level information is included;
  the smallest surviving cell size is 10.

Sanity check against published statistics (15-wave run, 601 K
person-wave rows, 38 K+ panel retention per wave transition):
  MALE 25-29 IN_WORK → 95.6% stay employed, 2.7% unemployed, 1.7%
  inactive — within noise of ONS LFS 2-quarter flow rates.
  Average probability of staying in the same income decile year-on-
  year: 39.9% — consistent with IFS Living Standards publications.

Test coverage: 9 hermetic tests using a synthetic panel (no real
microdata touched in CI). Probability-sum invariants, planted-rate
recovery, disclosure-control suppression, decile round-trip, and
optional tests against the committed aggregate CSVs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…refs #345)

Lands steps 1-4 of the remaining panel work:

- **UKHLS rates into `apply_employment_transitions`** (step 2):
  new `ukhls_rates=` kwarg accepts the nested dict produced by
  `utils.ukhls_transitions.load_employment_transitions`. When
  supplied, each working-age person is reclassified into the
  four-state labour market label (IN_WORK / UNEMPLOYED / RETIRED /
  INACTIVE), their next state is drawn from the empirical age-band
  × sex transition matrix, and income / employment_status columns
  are updated accordingly. Retirement at SPA and wage drift still
  run in addition. The rule-based job_loss/job_gain path is
  bypassed, so pass both rates even if you set them to zero elsewhere.

- **`apply_income_decile_transitions`** (step 3): new function in
  `utils/household_transitions.py` that, given the UKHLS decile
  transition table, ranks workers into deciles within each
  (age_band, sex) cell, draws a destination decile, and rescales
  their income by the ratio of target / origin decile medians.
  Preserves relative within-decile position; missing or
  suppressed-cell tuples pass through unchanged.

- **`advance_year` composer** (step 1): new
  `utils/advance_year.py` chains migration → separation →
  leaving-home → marriage → employment transitions → decile
  transitions → mortality / fertility / age increment → uprating
  under a single seeded RNG. Ordering is load-bearing and
  documented inline; every step returns a fresh dataset so nothing
  mutates the caller's input.

- **`create_yearly_snapshots` mode kwarg** (step 4): adds
  `mode="uprate_only"` (default, unchanged behaviour) and
  `mode="full_panel"` which rolls forward year-by-year via
  `advance_year`. Panel IDs evolve under full_panel mode (deaths,
  births, migration), so the byte-for-byte ID assertion only runs
  in uprate-only mode. Per-year seed is derived as
  `seed + (year - base_year)` so runs are reproducible from a
  single scalar.

Test coverage: 38 tests across the four areas.
- 14 total in apply_employment_transitions (4 new for UKHLS rates).
- 7 in apply_income_decile_transitions.
- 7 in advance_year (mutation-free, age increment, determinism,
  all-disabled smoke test, UKHLS-rates wiring, uprating, default year).
- 10 in yearly_snapshots (3 new for full_panel mode: writes each
  year, year-over-year ageing, unknown-mode error).
135-test suite across the whole panel pipeline stays green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict was in `policyengine_uk_data/utils/calibrate.py`: main added
`load_weights` (#351 defensive h5-weight loader) at the same file
position where this branch added `compute_log_weight_smoothness_penalty`
(#346 step 5). Both functions are independent and both stay.

All 150 tests in the panel-pipeline suite + the calibrate smoothness
tests pass after the merge, including `load_weights` consumers pulled
in from main (test_calibrate_save, test_la_land_value_targets).
@vahid-ahmadi
Copy link
Copy Markdown
Collaborator Author

Follow-up PRs (explicitly not this one)

  1. Wire create_yearly_snapshots(mode="full_panel") into create_datasets.py:main() — the default build starts producing enhanced_frs_<year>.h5 for every year in a declared range. Kept separate per the PR body so the change-of-default conversation happens in its own PR.

  2. Per-year calibration loop with smoothness penalty on — run calibrate_local_areas on each produced snapshot, passing the prior year's final weights as prior_weights and a tuned smoothness_penalty coefficient. Plumbing is here; the orchestration is not.

  3. Tune the smoothness-penalty coefficient — small sweep on real loss matrices (the empirical piece the PR body explicitly defers).

  4. Validation harness — script that compares panel-rolled output year-on-year aggregates against DWP Income Dynamics published transition tables + IFS income mobility figures. Signals when any transition rate is drifting from published statistics.

  5. policyengine-uk sibling-repo changes — runtime-uprating skip options and fixture migration, per docs/panel_downstream.md.

  6. LFS 5-quarter rotation panel as a cross-check source — UKHLS covers the long horizon; LFS gives a richer short-horizon view of employment-state transitions. Useful as an independent validator of the employment-state matrix in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate panel data across years (synthetic panel on FRS 2023-24 base)

1 participant