Panel pipeline: full plumbing for per-year snapshots (closes #345) by vahid-ahmadi · Pull Request #346 · PolicyEngine/policyengine-uk-data

vahid-ahmadi · 2026-04-17T10:12:13Z

Summary

Turns the single-year cross-sectional pipeline into a full dynamic panel microsimulation: real ONS demographic rates, real UKHLS-estimated within-person transitions, and a one-call composer that ages the UK population forward year-by-year. Closes #345.

No change to the default build. Every new behaviour is opt-in via create_yearly_snapshots(mode="full_panel") or by calling advance_year() directly.

The PR grew into four layers:

Panel-ID plumbing (steps 1-2 of Generate panel data across years (synthetic panel on FRS 2023-24 base) #345): ID contract + yearly-snapshot helper.
Dynamics: demographic ageing + the five life-cycle transitions (marriage, separation, children leaving home, migration, within-person income/employment).
Real data: ONS mortality / fertility / marriage / divorce / leaving-home / migration statistics, and UKHLS panel microdata (Waves 1-15) for within-person transitions.
Composition: advance_year and a mode="full_panel" on create_yearly_snapshots to chain it all together deterministically under one seed.

What's in the pipeline

Every component below ships as opt-in machinery — callers decide which rates to use and when to run transitions.

Panel identity and snapshots

utils/panel_ids.py: PANEL_ID_COLUMNS, get_panel_ids, assert_panel_id_consistency, classify_panel_ids, PanelIDTransition. Documents household_id / benunit_id / person_id as the panel keys.
datasets/yearly_snapshots.py: create_yearly_snapshots(base, years, output_dir, *, mode, ...).
- mode="uprate_only" (default): uprate each target year, IDs preserved byte-for-byte.
- mode="full_panel": roll the population forward year-by-year via advance_year; IDs evolve (deaths / births / migration).
storage/upload_yearly_snapshots.py: parallel uploader to the existing private HF destination. Destination constants are not exposed as arguments — redirecting the upload requires a code edit reviewed under CLAUDE.md data-protection rules.

Demographic ageing

utils/demographic_ageing.py: age_dataset(base, years, *, mortality_rates, fertility_rates, seed).
Mortality removes person rows, fertility appends newborns attached to the mother's benunit + household, age increments.
Accepts either age-only Mapping[int, float] or sex-specific Mapping[str, Mapping[int, float]] for mortality.

Household-composition transitions

All in utils/household_transitions.py:

apply_marriages: pair unmarried adults within region, closest-age match, merge benunits (weights folded).
apply_separations: split two-adult benunits; children default to the mother.
apply_children_leaving_home: move adult dependents (both FRS shapes — dependent young adults on parents' benunit, and separate benunits sharing the parental household) into fresh benunit + household.
apply_migration: Poisson-distributed net inflow / outflow per age, inflow cloned from same-age donors, outflow removed cleanly.

Labour-market transitions

apply_employment_transitions(dataset, *, ukhls_rates=None, ...):
- Rule-based (default): retirement at SPA → CPI-plus wage drift → job loss / gain.
- UKHLS-driven (opt-in): classifies each working-age person into a four-state label and draws the next state from the empirical age-band × sex transition matrix estimated from UKHLS.
apply_income_decile_transitions: ranks workers into deciles within (age_band, sex), draws a destination decile from the UKHLS matrix, rescales income by the ratio of target/origin decile medians.

Year-aware calibration plumbing

targets/build_loss_matrix.py: resolve_target_value(target, year, *, tolerance=3), constant YEAR_FALLBACK_TOLERANCE = 3.
Two real bug fixes picked up along the way:
- local_authorities/loss.py was reading household_weight at a hard-coded 2025.
- constituencies/loss.py was reading dataset.time_period instead of the passed time_period argument in two places.
  Both now honour the argument.
docs/targets_coverage.md: year coverage across every target category plus the remaining data-sourcing gaps.

Cross-year smoothness

utils/calibrate.py: compute_log_weight_smoothness_penalty, plus two keyword-only kwargs on calibrate_local_areas (prior_weights, smoothness_penalty). When both are supplied, a log-space L2 penalty pulls the current year's weights towards the prior year's. Off by default — pre-step-5 behaviour is reproduced bit-for-bit.

Top-level composer

utils/advance_year.py: advance_year(dataset, *, target_year, seed, ...). Chains migration → separation → leaving-home → marriage → employment transitions → income-decile transitions → mortality → fertility → age increment → uprating. One seeded RNG so the whole year is reproducible.

Real-data backing

All seven rate sources are either public ONS data (committed alongside the existing ONS xlsx fixtures in storage/) or aggregated UKHLS derivatives (cell-suppression ≥ 10 applied, raw microdata never committed).

Transition	Source	Location on branch
Mortality	ONS National Life Tables UK (1980–2024, age × sex × qx)	`targets/sources/ons_mortality.py` + `storage/ons_national_life_tables.xlsx`
Fertility	ONS Births in England and Wales: registrations Table 10	`targets/sources/ons_fertility.py` + `storage/ons_asfr.xlsx`
Marriage	ONS Marriage Statistics England & Wales	`utils/household_transitions.py`
Separation	ONS Divorce Statistics England & Wales	`utils/household_transitions.py`
Children leaving home	ONS LFS Young adults living with parents	`utils/household_transitions.py`
Migration	ONS Long-Term International Migration	`utils/household_transitions.py`
Within-person income / employment	UKHLS Waves 1–15 (SN 6614), 601,795 person-wave obs	`datasets/ukhls.py` + `utils/ukhls_transitions.py` + committed aggregated CSVs

UKHLS pipeline detail

datasets/ukhls.py reads the UKDS-licensed .dta files from storage/ukhls/ (git-ignored; raw microdata never leaves the machine).
utils/ukhls_transitions.py pairs consecutive waves on pidp (cross-wave person key) and estimates:
- P(state_{t+1} | state_t, age_band, sex) — four-state (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE)
- P(decile_{t+1} | decile_t, age_band, sex) — within-wave income deciles
Cells with fewer than 10 observations are dropped (ONS Safe Setting convention). Probabilities re-normalised within surviving (age_band, sex, state_from) row groups.
Aggregated outputs committed to storage/:
- ukhls_employment_state_transitions.csv (≈16 KB)
- ukhls_income_decile_transitions.csv (≈94 KB)
Spot-check validation vs published statistics:
- MALE 25–29 IN_WORK → IN_WORK = 95.6 % (matches ONS LFS 2-quarter flow rate).
- MALE 25–29 UNEMPLOYED → IN_WORK = 36.5 % (consistent with ONS published ~35–40 %).
- Avg probability of staying in the same income decile year-on-year = 39.9 % (consistent with IFS Living Standards mobility estimates).

Data-protection safeguards

**/*.dta and storage/ukhls/ are gitignored (both directory and symlink forms).
Only aggregated, cell-suppressed transition tables are committed.
storage/upload_yearly_snapshots.py locks the private HF repo + GCS bucket constants at function level; redirecting the upload requires a reviewable code edit.

What's not in this PR (explicit non-goals)

create_datasets.py:main() refactor to invoke create_yearly_snapshots automatically — kept as a follow-up so the change-of-default conversation happens in its own PR.
Per-year calibration against year-specific targets — plumbing is here (resolve_target_value, smoothness penalty), but the full year-loop wiring belongs in a separate step.
Smoothness-penalty coefficient tuning — empirical question best answered against full loss matrices, deferred.
policyengine-uk changes — runtime-uprating skip flags, fixture migration — documented in docs/panel_downstream.md, left to the sibling repo.
Same-sex marriage, cohabitation, second-marriage modelling — deferred.
Admin-linked data (DWP Safe Access UKHLS variants) — require Secure Access, not a practical next step.
End-to-end smoke test on the real enhanced_frs_2023_24.h5 — 176 in-tree tests cover every unit and a composed advance_year run, but the full-panel dry-run through create_yearly_snapshots is a follow-up I'll attach before moving this out of Draft.

Test plan

176 new tests across the stack, all green locally; no real FRS or UKHLS microdata touched in CI (every test uses in-memory synthetic fixtures or committed aggregate CSVs).

test_advance_year.py                          7
test_age_dataset_sex_specific_mortality.py    5
test_apply_children_leaving_home.py           8
test_apply_employment_transitions.py         14
test_apply_income_decile_transitions.py       7
test_apply_marriages.py                       9
test_apply_migration.py                       9
test_apply_separations.py                     8
test_calibrate_smoothness_integration.py      5
test_conftest_fixtures.py                     7
test_demographic_ageing.py                   23
test_ons_fertility.py                         9
test_ons_mortality.py                         7
test_panel_ids.py                            10
test_resolve_target_value.py                 12
test_smoothness_penalty.py                   10
test_ukhls_transitions.py                     9
test_upload_yearly_snapshots.py               7
test_yearly_snapshots.py                     10

Notable coverage

Upload safety: the private HF repo, HF repo type and GCS bucket are locked by test; the function signature is asserted to only allow years and storage_folder as kwargs.
No-partial-upload invariant: a missing file aborts the upload call before any network traffic.
Smoothness regulariser: gradient masking on zero-prior entries, log-space symmetry, large-penalty integration test that proves pull-towards-prior.
Panel-ID contract: survivors, deaths, and births classified correctly across a single age_dataset run.
Disclosure control: UKHLS transition tests build a tiny cohort and confirm every low-count cell is suppressed before the CSV is written.
UKHLS integration: hermetic synthetic panel recovers planted transition probabilities within Monte-Carlo noise, plus a @pytest.mark.skipif path that exercises the committed aggregate CSVs when present.
Determinism: every stochastic transition has a same-seed → same-output test.

Running a one-year panel roll

from policyengine_uk_data.utils.advance_year import advance_year
from policyengine_uk_data.utils.ukhls_transitions import (
    load_employment_transitions, load_income_decile_transitions,
)
from policyengine_uk_data.targets.sources.ons_mortality import get_mortality_rates
from policyengine_uk_data.targets.sources.ons_fertility import get_fertility_rates

out = advance_year(
    base,
    target_year=2024,
    seed=0,
    mortality_rates=get_mortality_rates(2024),
    fertility_rates=get_fertility_rates(2024),
    ukhls_employment_rates=load_employment_transitions(),
    ukhls_decile_rates=load_income_decile_transitions(),
)

Tracks and closes: #345.

First step towards the per-year panel pipeline described in #345: document that household_id, benunit_id and person_id are the panel keys that must be preserved across yearly snapshots, and add a reusable `assert_panel_id_consistency` utility so future year-loop code can enforce the invariant at save time and in tests. No behaviour change to the current single-year pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Unblocks the Lint check on #346. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a standalone helper that takes an already-imputed base dataset and produces one `enhanced_frs_<year>.h5` file per requested year by calling `uprate_dataset` and saving. Every snapshot is verified against the base with `assert_panel_id_consistency` at save time, so any future step that mutates the person/benunit/household tables (e.g. demographic ageing in step 3) cannot silently break the panel key contract. Deliberately out of scope for this PR — tracked in #345: - per-year calibration (needs year-specific targets, step 4) - demographic ageing (step 3) - restructuring `create_datasets.py:main()` to call this helper The existing single-year pipeline is untouched; callers opt in to panel output by invoking `create_yearly_snapshots` directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces `age_dataset(base, years, *, seed, mortality_rates, fertility_rates)` — the minimum-viable demographic ageing described in the plan. Per year step: - every surviving person's `age` column is incremented, - persons sampled as dying are removed, - new babies are appended with fresh, non-colliding `person_id` values and attached to the mother's existing benefit unit and household. Deterministic via the `seed` argument. Placeholder mortality and fertility tables ship with the module so it runs end-to-end in tests — they are explicitly named `_PLACEHOLDER` and are due to be replaced by real ONS life tables and fertility rates in a follow-up. Also extends `utils/panel_ids` with `classify_panel_ids(base, other)` and a `PanelIDTransition` dataclass so tests and diagnostics can describe the survivors / deaths / births move without tripping the strict `assert_panel_id_consistency` check (which remains the right tool for uprating-style transforms that must not change ID sets). Out of scope, tracked in #345: - real ONS life tables and fertility rates, - marriage, separation and leaving-home dynamics, - migration, - integration into `create_yearly_snapshots` — callers chain `age_dataset` and `uprate_dataset` themselves for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes two concrete bugs that would have prevented calibrating the same base dataset at a year other than its stored `time_period`: - `local_authorities/loss.py` read household weights at a hard-coded 2025 when computing the national-total fallbacks used for LAs missing ONS data. Now uses the explicit `time_period` argument. - `constituencies/loss.py` passed `dataset.time_period` to `get_national_income_projections` and `sim.default_calculation_period` even when the caller supplied a different `time_period`. Same fix. Also extracts the year-resolution logic from `build_loss_matrix._resolve_value` into a documented public function `resolve_target_value`, names the three-year tolerance as a constant, and adds 12 unit tests covering the fallback policy (exact match, nearest past year, tolerance limit, no backwards extrapolation, VOA population scaling). Ships `docs/targets_coverage.md` documenting year coverage across every target category and where the real gaps are (DWP 2026+, local-area CSV refreshes). No new data sourced in this PR — sourcing is deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an opt-in log-space L2 penalty to the training loss in `calibrate_local_areas` that pulls the optimised weights towards a prior year's weights. This is the regulariser that makes a sequence of per-year calibrations statistically coherent as a panel — without it, the same household can represent, say, 500 units in 2024 and 50 in 2025. Design choices: - The penalty is factored out into a pure helper `compute_log_weight_smoothness_penalty(log_weights, prior_weights)` so it can be unit-tested thoroughly. Entries where the prior is zero (households outside an area's country) are excluded from the mean so they neither pull nor inflate the penalty. - `calibrate_local_areas` gains two keyword-only kwargs, `prior_weights` and `smoothness_penalty`, both defaulting to values that reproduce the pre-step-5 training loop exactly. - Shape mismatches raise a clear `ValueError` rather than failing deep inside the optimiser. - The penalty is computed from the underlying log-space weights (not the dropout-augmented tensor fed into the fit loss) so the regulariser does not double-count the dropout noise. Tests (15 new, all in two files): - 10 unit tests on the helper covering zero-when-equal, quadratic scaling, masking of zero-prior entries, gradient masking, shape validation, symmetric log deviation, differentiability, dtype round-trip and a hand-computed heterogeneous case. - 5 integration tests on `calibrate_local_areas` with a three-household fake dataset: default kwargs reproduce pre-step-5 behaviour, shape mismatch raises, `None` prior + penalty is a no-op, zero penalty + prior is a no-op, and a large penalty measurably pulls weights towards the prior versus a no-smoothness run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Final step of the plan in #345: the consumer-facing plumbing that lets something actually use per-year panel snapshots. - `policyengine_uk_data/storage/upload_yearly_snapshots.py` — a deliberately parallel uploader to `upload_completed_datasets.py`. Pointed at the same private HuggingFace repo and GCS bucket; the destination constants are not exposed as function arguments, so redirecting the upload requires a code edit reviewed under CLAUDE.md's data-protection rules. The existing `upload_completed_datasets.py` is untouched. - `policyengine_uk_data/tests/conftest.py` — adds `enhanced_frs_for_year` factory fixture. Resolves `enhanced_frs_<year>.h5` and falls back to the legacy `enhanced_frs_2023_24.h5` for the 2023 base year so existing tests keep passing without modification. Skips (rather than errors) if the requested year's file is missing. - `docs/panel_downstream.md` — coordination note for sibling `policyengine-uk` repo: runtime-uprating skip options, fixture migration pattern, sensible default year set. Tests (14 new): - 7 on the uploader: pure path construction, iterable acceptance, empty-list rejection, missing-file rejection with no partial upload, upload-arguments lock to the private destination, destination constants locked to private repo, function signature does not allow redirect via kwargs. - 7 on the fixture factory: resolves `enhanced_frs_<year>.h5`, skips cleanly when year is missing, falls back to legacy filename for 2023, prefers the new filename when both exist, accepts int and str years, existing `enhanced_frs` fixture still points at legacy name, `STORAGE_FOLDER` export is not accidentally shadowed. Out of scope, flagged in `docs/panel_downstream.md`: - Modifying `policyengine-uk` itself (separate repo). - Changing the audited upload destinations. - Actual decision on skip-vs-always-uprate at simulation time — the doc presents the two options and the tradeoffs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI's pandas returned the panel gender column as StringDtype rather than the plain object dtype I saw locally. `numpy.ndarray.astype` only accepts numpy-compatible dtypes, so passing `StringDtype` on line 186 raised TypeError: Cannot interpret '<StringDtype(...)>' as a data type. Fix: use object arrays when building newborn rows and let pandas coerce them back to the template's extension dtype during `pd.concat`. Same pattern applied to other non-numeric template columns in `_build_newborn_rows`. Adds a regression test that explicitly casts `gender` to pandas StringDtype before calling `age_dataset`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `targets/sources/ons_mortality.py`, which parses the UK National Life Tables workbook (~600 KB, shipped alongside the other ONS xlsx fixtures in `policyengine_uk_data/storage/`) into a long-format frame keyed by period / sex / age / qx. Exposes `get_mortality_rates(year)` returning `{sex: {age: qx}}` for the 3-year rolling period covering a calendar year (with nearest-past fallback), plus a unisex helper. Extends `age_dataset` in `utils/demographic_ageing.py` to accept the sex-specific mapping shape in addition to the existing age-only mapping. Detection is by key type, so the existing placeholder rates and every current test continue to work unchanged. Placeholder mortality rates are kept as the fallback default, but the docstring now points callers at `get_mortality_rates` for real data. Test coverage: 7 loader tests against a synthetic in-tree workbook (period resolution, nearest-past fallback, unisex averaging, non- period sheet filtering) plus 5 age_dataset integration tests (backwards-compat, sex-specific kill/spare behaviour, missing-sex fallback, missing-age default, real-rate shape sanity on a toy pop). All 35 tests in the affected modules pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `targets/sources/ons_fertility.py`, which parses Table 10 of the ONS *Births in England and Wales: registrations* workbook (~840 KB, shipped alongside the other ONS xlsx fixtures in `policyengine_uk_data/storage/`). Exposes `load_ons_fertility_rates()` returning a long-format frame keyed by year / country / age_low / age_high / rate_per_1000, and `get_fertility_rates(year, country=...)` returning a single-year `{age: probability}` map that plugs straight into `age_dataset(..., fertility_rates=...)`. Handles the ONS band format: - "Under 20" → ages 15-19 (conventional start of the fertility window). - "20 to 24" ... "35 to 39" → ages 20-24 ... 35-39, uniform within band. - "40 and over" → ages 40-44 only (5-year cap). Expanding an open band uniformly across the whole fertility window would otherwise overstate ASFR at ages 45+ by an order of magnitude, since the overwhelming majority of 40+ births happen at 40-44. - Rates converted from births-per-1 000 to per-woman-per-year probability. Year resolution: exact match preferred, nearest past year as fallback (mirrors the mortality loader). Future years silently fall back to the latest available; pre-1938 requests raise a clear KeyError. Test coverage: 9 new loader/integration tests against a synthetic in-tree workbook (year fallback, open-band cap, under-20 lower bound, country filter, probability scaling, end-to-end age_dataset integration). Zero network access in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…345) Introduces `utils/household_transitions.py` with five life-cycle transitions that complement the existing mortality + fertility mechanics: - `apply_marriages`: pair single adults with an in-region opposite-sex partner (closest-age match), merge benunits and fold weights. Uses age × sex rates (ONS Marriage Statistics smoothed averages). - `apply_separations`: split two-adult benunits with the couple's mean-age rate (ONS Divorce Statistics). Children attach to the mother by default; overridable. New benunit + household rows are minted for the mover and regions are preserved. - `apply_children_leaving_home`: move adult dependents out of their parents' benunit + household. Handles both FRS shapes (dependent young adult on parents' benunit, or adult child with their own single benunit inside the parental household). Uses age-indexed rates (ONS LFS "Young adults living with parents"). - `apply_migration`: Poisson-distributed net inflow/outflow by age (ONS Long-Term International Migration estimates). Immigration clones donor rows at the same age; emigration randomly removes rows and cleans up orphaned benunit/household rows. - `apply_employment_transitions`: rule-based placeholder for within-person labour-market moves — retirement at state-pension age, CPI-plus wage drift, and configurable job loss/gain rates with nearest-age income donor for gainers. Will be replaced with UKHLS-estimated rates in a follow-up. All functions are pure (no mutation), deterministic given an explicit RNG, and use only columns and ID shapes already present in the FRS / pe-uk schema. `is_married`, `is_single`, `is_couple` etc. pick up the changes automatically because they are derived from adult counts in benunits. 44 new tests across the five modules cover: rate-zero is a no-op, rate-one produces maximal transitions, derived boolean flags flip correctly, benunit/household rows stay consistent, deterministic under a fixed seed, and the default rates produce sensible aggregates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds *.dta to the gitignore pattern and explicitly ignores policyengine_uk_data/storage/ukhls (both directory and symlink forms). UKHLS is UKDS-licensed under the same EUL regime as the FRS; per CLAUDE.md, individual-level records must never enter the repo. Only aggregate, non-disclosive outputs derived from this data may be tracked.

Adds the UKHLS side of the panel pipeline — the last piece needed for real within-person year-on-year dynamics, complementing the ONS mortality / fertility / marriage / separation / leaving-home / migration rates already on the branch. Three pieces: - `datasets/ukhls.py`: loader for the UKDS-licensed Main Survey indresp files (Waves 1-15, study SN 6614). Normalises wave-letter prefixes away, harmonises the rich jbstat enum into a four-state labour market label (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE). The raw .dta files live under `storage/ukhls/` and are git-ignored along with all .dta patterns; only this loader sees individual rows. - `utils/ukhls_transitions.py`: estimator that pairs consecutive waves on `pidp` (the cross-wave person key) and computes P(state_t+1 | state_t, age_band, sex) for employment states and P(decile_t+1 | decile_t, age_band, sex) for income deciles. Cells with fewer than 10 observations are suppressed to meet ONS Safe Setting disclosure-control conventions; probabilities are renormalised post-suppression so each (age_band, sex, state_from) group still sums to 1. - Aggregated outputs, committed to `storage/`: - `ukhls_employment_state_transitions.csv` (15.6 KB) - `ukhls_income_decile_transitions.csv` (94 KB) Both are safe to ship. No individual-level information is included; the smallest surviving cell size is 10. Sanity check against published statistics (15-wave run, 601 K person-wave rows, 38 K+ panel retention per wave transition): MALE 25-29 IN_WORK → 95.6% stay employed, 2.7% unemployed, 1.7% inactive — within noise of ONS LFS 2-quarter flow rates. Average probability of staying in the same income decile year-on- year: 39.9% — consistent with IFS Living Standards publications. Test coverage: 9 hermetic tests using a synthetic panel (no real microdata touched in CI). Probability-sum invariants, planted-rate recovery, disclosure-control suppression, decile round-trip, and optional tests against the committed aggregate CSVs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…refs #345) Lands steps 1-4 of the remaining panel work: - **UKHLS rates into `apply_employment_transitions`** (step 2): new `ukhls_rates=` kwarg accepts the nested dict produced by `utils.ukhls_transitions.load_employment_transitions`. When supplied, each working-age person is reclassified into the four-state labour market label (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE), their next state is drawn from the empirical age-band × sex transition matrix, and income / employment_status columns are updated accordingly. Retirement at SPA and wage drift still run in addition. The rule-based job_loss/job_gain path is bypassed, so pass both rates even if you set them to zero elsewhere. - **`apply_income_decile_transitions`** (step 3): new function in `utils/household_transitions.py` that, given the UKHLS decile transition table, ranks workers into deciles within each (age_band, sex) cell, draws a destination decile, and rescales their income by the ratio of target / origin decile medians. Preserves relative within-decile position; missing or suppressed-cell tuples pass through unchanged. - **`advance_year` composer** (step 1): new `utils/advance_year.py` chains migration → separation → leaving-home → marriage → employment transitions → decile transitions → mortality / fertility / age increment → uprating under a single seeded RNG. Ordering is load-bearing and documented inline; every step returns a fresh dataset so nothing mutates the caller's input. - **`create_yearly_snapshots` mode kwarg** (step 4): adds `mode="uprate_only"` (default, unchanged behaviour) and `mode="full_panel"` which rolls forward year-by-year via `advance_year`. Panel IDs evolve under full_panel mode (deaths, births, migration), so the byte-for-byte ID assertion only runs in uprate-only mode. Per-year seed is derived as `seed + (year - base_year)` so runs are reproducible from a single scalar. Test coverage: 38 tests across the four areas. - 14 total in apply_employment_transitions (4 new for UKHLS rates). - 7 in apply_income_decile_transitions. - 7 in advance_year (mutation-free, age increment, determinism, all-disabled smoke test, UKHLS-rates wiring, uprating, default year). - 10 in yearly_snapshots (3 new for full_panel mode: writes each year, year-over-year ageing, unknown-mode error). 135-test suite across the whole panel pipeline stays green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Conflict was in `policyengine_uk_data/utils/calibrate.py`: main added `load_weights` (#351 defensive h5-weight loader) at the same file position where this branch added `compute_log_weight_smoothness_penalty` (#346 step 5). Both functions are independent and both stay. All 150 tests in the panel-pipeline suite + the calibrate smoothness tests pass after the merge, including `load_weights` consumers pulled in from main (test_calibrate_save, test_la_land_value_targets).

vahid-ahmadi · 2026-04-21T11:15:50Z

Follow-up PRs (explicitly not this one)

Wire create_yearly_snapshots(mode="full_panel") into create_datasets.py:main() — the default build starts producing enhanced_frs_<year>.h5 for every year in a declared range. Kept separate per the PR body so the change-of-default conversation happens in its own PR.
Per-year calibration loop with smoothness penalty on — run calibrate_local_areas on each produced snapshot, passing the prior year's final weights as prior_weights and a tuned smoothness_penalty coefficient. Plumbing is here; the orchestration is not.
Tune the smoothness-penalty coefficient — small sweep on real loss matrices (the empirical piece the PR body explicitly defers).
Validation harness — script that compares panel-rolled output year-on-year aggregates against DWP Income Dynamics published transition tables + IFS income mobility figures. Signals when any transition rate is drifting from published statistics.
policyengine-uk sibling-repo changes — runtime-uprating skip options and fixture migration, per docs/panel_downstream.md.
LFS 5-quarter rotation panel as a cross-check source — UKHLS covers the long horizon; LFS gives a richer short-horizon view of employment-state transitions. Useful as an independent validator of the employment-state matrix in this PR.

vahid-ahmadi and others added 3 commits April 17, 2026 11:11

Apply ruff format to test_panel_ids.py

78e8376

Unblocks the Lint check on #346. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vahid-ahmadi changed the title ~~Add panel ID contract and consistency utility (step 1 of #345)~~ Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345) Apr 17, 2026

vahid-ahmadi changed the title ~~Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345)~~ Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345) Apr 17, 2026

vahid-ahmadi changed the title ~~Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345)~~ Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345) Apr 17, 2026

vahid-ahmadi changed the title ~~Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345)~~ Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345) Apr 17, 2026

vahid-ahmadi self-assigned this Apr 17, 2026

vahid-ahmadi changed the title ~~Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345)~~ Panel pipeline: full plumbing for per-year snapshots (closes #345) Apr 17, 2026

vahid-ahmadi requested a review from MaxGhenis April 17, 2026 12:45

vahid-ahmadi marked this pull request as draft April 17, 2026 15:32

vahid-ahmadi mentioned this pull request Apr 20, 2026

Real ONS mortality rates for demographic ageing (stacks on #346) #373

Closed

vahid-ahmadi and others added 7 commits April 20, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346
vahid-ahmadi wants to merge 15 commits intomainfrom
feat/panel-persistent-ids-345

vahid-ahmadi commented Apr 17, 2026 •

edited

Loading

Uh oh!

vahid-ahmadi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vahid-ahmadi commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the pipeline

Panel identity and snapshots

Demographic ageing

Household-composition transitions

Labour-market transitions

Year-aware calibration plumbing

Cross-year smoothness

Top-level composer

Real-data backing

UKHLS pipeline detail

Data-protection safeguards

What's not in this PR (explicit non-goals)

Test plan

Notable coverage

Running a one-year panel roll

Uh oh!

vahid-ahmadi commented Apr 21, 2026

Follow-up PRs (explicitly not this one)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vahid-ahmadi commented Apr 17, 2026 •

edited

Loading