Panel pipeline: full plumbing for per-year snapshots (closes #345)#346
Panel pipeline: full plumbing for per-year snapshots (closes #345)#346vahid-ahmadi wants to merge 15 commits intomainfrom
Conversation
First step towards the per-year panel pipeline described in #345: document that household_id, benunit_id and person_id are the panel keys that must be preserved across yearly snapshots, and add a reusable `assert_panel_id_consistency` utility so future year-loop code can enforce the invariant at save time and in tests. No behaviour change to the current single-year pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unblocks the Lint check on #346. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a standalone helper that takes an already-imputed base dataset and produces one `enhanced_frs_<year>.h5` file per requested year by calling `uprate_dataset` and saving. Every snapshot is verified against the base with `assert_panel_id_consistency` at save time, so any future step that mutates the person/benunit/household tables (e.g. demographic ageing in step 3) cannot silently break the panel key contract. Deliberately out of scope for this PR — tracked in #345: - per-year calibration (needs year-specific targets, step 4) - demographic ageing (step 3) - restructuring `create_datasets.py:main()` to call this helper The existing single-year pipeline is untouched; callers opt in to panel output by invoking `create_yearly_snapshots` directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces `age_dataset(base, years, *, seed, mortality_rates, fertility_rates)` — the minimum-viable demographic ageing described in the plan. Per year step: - every surviving person's `age` column is incremented, - persons sampled as dying are removed, - new babies are appended with fresh, non-colliding `person_id` values and attached to the mother's existing benefit unit and household. Deterministic via the `seed` argument. Placeholder mortality and fertility tables ship with the module so it runs end-to-end in tests — they are explicitly named `_PLACEHOLDER` and are due to be replaced by real ONS life tables and fertility rates in a follow-up. Also extends `utils/panel_ids` with `classify_panel_ids(base, other)` and a `PanelIDTransition` dataclass so tests and diagnostics can describe the survivors / deaths / births move without tripping the strict `assert_panel_id_consistency` check (which remains the right tool for uprating-style transforms that must not change ID sets). Out of scope, tracked in #345: - real ONS life tables and fertility rates, - marriage, separation and leaving-home dynamics, - migration, - integration into `create_yearly_snapshots` — callers chain `age_dataset` and `uprate_dataset` themselves for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes two concrete bugs that would have prevented calibrating the same base dataset at a year other than its stored `time_period`: - `local_authorities/loss.py` read household weights at a hard-coded 2025 when computing the national-total fallbacks used for LAs missing ONS data. Now uses the explicit `time_period` argument. - `constituencies/loss.py` passed `dataset.time_period` to `get_national_income_projections` and `sim.default_calculation_period` even when the caller supplied a different `time_period`. Same fix. Also extracts the year-resolution logic from `build_loss_matrix._resolve_value` into a documented public function `resolve_target_value`, names the three-year tolerance as a constant, and adds 12 unit tests covering the fallback policy (exact match, nearest past year, tolerance limit, no backwards extrapolation, VOA population scaling). Ships `docs/targets_coverage.md` documenting year coverage across every target category and where the real gaps are (DWP 2026+, local-area CSV refreshes). No new data sourced in this PR — sourcing is deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in log-space L2 penalty to the training loss in `calibrate_local_areas` that pulls the optimised weights towards a prior year's weights. This is the regulariser that makes a sequence of per-year calibrations statistically coherent as a panel — without it, the same household can represent, say, 500 units in 2024 and 50 in 2025. Design choices: - The penalty is factored out into a pure helper `compute_log_weight_smoothness_penalty(log_weights, prior_weights)` so it can be unit-tested thoroughly. Entries where the prior is zero (households outside an area's country) are excluded from the mean so they neither pull nor inflate the penalty. - `calibrate_local_areas` gains two keyword-only kwargs, `prior_weights` and `smoothness_penalty`, both defaulting to values that reproduce the pre-step-5 training loop exactly. - Shape mismatches raise a clear `ValueError` rather than failing deep inside the optimiser. - The penalty is computed from the underlying log-space weights (not the dropout-augmented tensor fed into the fit loss) so the regulariser does not double-count the dropout noise. Tests (15 new, all in two files): - 10 unit tests on the helper covering zero-when-equal, quadratic scaling, masking of zero-prior entries, gradient masking, shape validation, symmetric log deviation, differentiability, dtype round-trip and a hand-computed heterogeneous case. - 5 integration tests on `calibrate_local_areas` with a three-household fake dataset: default kwargs reproduce pre-step-5 behaviour, shape mismatch raises, `None` prior + penalty is a no-op, zero penalty + prior is a no-op, and a large penalty measurably pulls weights towards the prior versus a no-smoothness run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final step of the plan in #345: the consumer-facing plumbing that lets something actually use per-year panel snapshots. - `policyengine_uk_data/storage/upload_yearly_snapshots.py` — a deliberately parallel uploader to `upload_completed_datasets.py`. Pointed at the same private HuggingFace repo and GCS bucket; the destination constants are not exposed as function arguments, so redirecting the upload requires a code edit reviewed under CLAUDE.md's data-protection rules. The existing `upload_completed_datasets.py` is untouched. - `policyengine_uk_data/tests/conftest.py` — adds `enhanced_frs_for_year` factory fixture. Resolves `enhanced_frs_<year>.h5` and falls back to the legacy `enhanced_frs_2023_24.h5` for the 2023 base year so existing tests keep passing without modification. Skips (rather than errors) if the requested year's file is missing. - `docs/panel_downstream.md` — coordination note for sibling `policyengine-uk` repo: runtime-uprating skip options, fixture migration pattern, sensible default year set. Tests (14 new): - 7 on the uploader: pure path construction, iterable acceptance, empty-list rejection, missing-file rejection with no partial upload, upload-arguments lock to the private destination, destination constants locked to private repo, function signature does not allow redirect via kwargs. - 7 on the fixture factory: resolves `enhanced_frs_<year>.h5`, skips cleanly when year is missing, falls back to legacy filename for 2023, prefers the new filename when both exist, accepts int and str years, existing `enhanced_frs` fixture still points at legacy name, `STORAGE_FOLDER` export is not accidentally shadowed. Out of scope, flagged in `docs/panel_downstream.md`: - Modifying `policyengine-uk` itself (separate repo). - Changing the audited upload destinations. - Actual decision on skip-vs-always-uprate at simulation time — the doc presents the two options and the tradeoffs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's pandas returned the panel gender column as StringDtype rather than the plain object dtype I saw locally. `numpy.ndarray.astype` only accepts numpy-compatible dtypes, so passing `StringDtype` on line 186 raised TypeError: Cannot interpret '<StringDtype(...)>' as a data type. Fix: use object arrays when building newborn rows and let pandas coerce them back to the template's extension dtype during `pd.concat`. Same pattern applied to other non-numeric template columns in `_build_newborn_rows`. Adds a regression test that explicitly casts `gender` to pandas StringDtype before calling `age_dataset`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `targets/sources/ons_mortality.py`, which parses the UK National
Life Tables workbook (~600 KB, shipped alongside the other ONS xlsx
fixtures in `policyengine_uk_data/storage/`) into a long-format frame
keyed by period / sex / age / qx. Exposes `get_mortality_rates(year)`
returning `{sex: {age: qx}}` for the 3-year rolling period covering a
calendar year (with nearest-past fallback), plus a unisex helper.
Extends `age_dataset` in `utils/demographic_ageing.py` to accept the
sex-specific mapping shape in addition to the existing age-only
mapping. Detection is by key type, so the existing placeholder rates
and every current test continue to work unchanged.
Placeholder mortality rates are kept as the fallback default, but the
docstring now points callers at `get_mortality_rates` for real data.
Test coverage: 7 loader tests against a synthetic in-tree workbook
(period resolution, nearest-past fallback, unisex averaging, non-
period sheet filtering) plus 5 age_dataset integration tests
(backwards-compat, sex-specific kill/spare behaviour, missing-sex
fallback, missing-age default, real-rate shape sanity on a toy pop).
All 35 tests in the affected modules pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `targets/sources/ons_fertility.py`, which parses Table 10 of the
ONS *Births in England and Wales: registrations* workbook (~840 KB,
shipped alongside the other ONS xlsx fixtures in
`policyengine_uk_data/storage/`). Exposes
`load_ons_fertility_rates()` returning a long-format frame keyed by
year / country / age_low / age_high / rate_per_1000, and
`get_fertility_rates(year, country=...)` returning a single-year
`{age: probability}` map that plugs straight into
`age_dataset(..., fertility_rates=...)`.
Handles the ONS band format:
- "Under 20" → ages 15-19 (conventional start of the fertility window).
- "20 to 24" ... "35 to 39" → ages 20-24 ... 35-39, uniform within band.
- "40 and over" → ages 40-44 only (5-year cap). Expanding an open band
uniformly across the whole fertility window would otherwise overstate
ASFR at ages 45+ by an order of magnitude, since the overwhelming
majority of 40+ births happen at 40-44.
- Rates converted from births-per-1 000 to per-woman-per-year
probability.
Year resolution: exact match preferred, nearest past year as fallback
(mirrors the mortality loader). Future years silently fall back to the
latest available; pre-1938 requests raise a clear KeyError.
Test coverage: 9 new loader/integration tests against a synthetic
in-tree workbook (year fallback, open-band cap, under-20 lower bound,
country filter, probability scaling, end-to-end age_dataset
integration). Zero network access in CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…345) Introduces `utils/household_transitions.py` with five life-cycle transitions that complement the existing mortality + fertility mechanics: - `apply_marriages`: pair single adults with an in-region opposite-sex partner (closest-age match), merge benunits and fold weights. Uses age × sex rates (ONS Marriage Statistics smoothed averages). - `apply_separations`: split two-adult benunits with the couple's mean-age rate (ONS Divorce Statistics). Children attach to the mother by default; overridable. New benunit + household rows are minted for the mover and regions are preserved. - `apply_children_leaving_home`: move adult dependents out of their parents' benunit + household. Handles both FRS shapes (dependent young adult on parents' benunit, or adult child with their own single benunit inside the parental household). Uses age-indexed rates (ONS LFS "Young adults living with parents"). - `apply_migration`: Poisson-distributed net inflow/outflow by age (ONS Long-Term International Migration estimates). Immigration clones donor rows at the same age; emigration randomly removes rows and cleans up orphaned benunit/household rows. - `apply_employment_transitions`: rule-based placeholder for within-person labour-market moves — retirement at state-pension age, CPI-plus wage drift, and configurable job loss/gain rates with nearest-age income donor for gainers. Will be replaced with UKHLS-estimated rates in a follow-up. All functions are pure (no mutation), deterministic given an explicit RNG, and use only columns and ID shapes already present in the FRS / pe-uk schema. `is_married`, `is_single`, `is_couple` etc. pick up the changes automatically because they are derived from adult counts in benunits. 44 new tests across the five modules cover: rate-zero is a no-op, rate-one produces maximal transitions, derived boolean flags flip correctly, benunit/household rows stay consistent, deterministic under a fixed seed, and the default rates produce sensible aggregates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds *.dta to the gitignore pattern and explicitly ignores policyengine_uk_data/storage/ukhls (both directory and symlink forms). UKHLS is UKDS-licensed under the same EUL regime as the FRS; per CLAUDE.md, individual-level records must never enter the repo. Only aggregate, non-disclosive outputs derived from this data may be tracked.
Adds the UKHLS side of the panel pipeline — the last piece needed for
real within-person year-on-year dynamics, complementing the ONS
mortality / fertility / marriage / separation / leaving-home /
migration rates already on the branch.
Three pieces:
- `datasets/ukhls.py`: loader for the UKDS-licensed Main Survey
indresp files (Waves 1-15, study SN 6614). Normalises wave-letter
prefixes away, harmonises the rich jbstat enum into a four-state
labour market label (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE).
The raw .dta files live under `storage/ukhls/` and are git-ignored
along with all .dta patterns; only this loader sees individual
rows.
- `utils/ukhls_transitions.py`: estimator that pairs consecutive
waves on `pidp` (the cross-wave person key) and computes
P(state_t+1 | state_t, age_band, sex) for employment states and
P(decile_t+1 | decile_t, age_band, sex) for income deciles. Cells
with fewer than 10 observations are suppressed to meet ONS Safe
Setting disclosure-control conventions; probabilities are
renormalised post-suppression so each (age_band, sex, state_from)
group still sums to 1.
- Aggregated outputs, committed to `storage/`:
- `ukhls_employment_state_transitions.csv` (15.6 KB)
- `ukhls_income_decile_transitions.csv` (94 KB)
Both are safe to ship. No individual-level information is included;
the smallest surviving cell size is 10.
Sanity check against published statistics (15-wave run, 601 K
person-wave rows, 38 K+ panel retention per wave transition):
MALE 25-29 IN_WORK → 95.6% stay employed, 2.7% unemployed, 1.7%
inactive — within noise of ONS LFS 2-quarter flow rates.
Average probability of staying in the same income decile year-on-
year: 39.9% — consistent with IFS Living Standards publications.
Test coverage: 9 hermetic tests using a synthetic panel (no real
microdata touched in CI). Probability-sum invariants, planted-rate
recovery, disclosure-control suppression, decile round-trip, and
optional tests against the committed aggregate CSVs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…refs #345) Lands steps 1-4 of the remaining panel work: - **UKHLS rates into `apply_employment_transitions`** (step 2): new `ukhls_rates=` kwarg accepts the nested dict produced by `utils.ukhls_transitions.load_employment_transitions`. When supplied, each working-age person is reclassified into the four-state labour market label (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE), their next state is drawn from the empirical age-band × sex transition matrix, and income / employment_status columns are updated accordingly. Retirement at SPA and wage drift still run in addition. The rule-based job_loss/job_gain path is bypassed, so pass both rates even if you set them to zero elsewhere. - **`apply_income_decile_transitions`** (step 3): new function in `utils/household_transitions.py` that, given the UKHLS decile transition table, ranks workers into deciles within each (age_band, sex) cell, draws a destination decile, and rescales their income by the ratio of target / origin decile medians. Preserves relative within-decile position; missing or suppressed-cell tuples pass through unchanged. - **`advance_year` composer** (step 1): new `utils/advance_year.py` chains migration → separation → leaving-home → marriage → employment transitions → decile transitions → mortality / fertility / age increment → uprating under a single seeded RNG. Ordering is load-bearing and documented inline; every step returns a fresh dataset so nothing mutates the caller's input. - **`create_yearly_snapshots` mode kwarg** (step 4): adds `mode="uprate_only"` (default, unchanged behaviour) and `mode="full_panel"` which rolls forward year-by-year via `advance_year`. Panel IDs evolve under full_panel mode (deaths, births, migration), so the byte-for-byte ID assertion only runs in uprate-only mode. Per-year seed is derived as `seed + (year - base_year)` so runs are reproducible from a single scalar. Test coverage: 38 tests across the four areas. - 14 total in apply_employment_transitions (4 new for UKHLS rates). - 7 in apply_income_decile_transitions. - 7 in advance_year (mutation-free, age increment, determinism, all-disabled smoke test, UKHLS-rates wiring, uprating, default year). - 10 in yearly_snapshots (3 new for full_panel mode: writes each year, year-over-year ageing, unknown-mode error). 135-test suite across the whole panel pipeline stays green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict was in `policyengine_uk_data/utils/calibrate.py`: main added `load_weights` (#351 defensive h5-weight loader) at the same file position where this branch added `compute_log_weight_smoothness_penalty` (#346 step 5). Both functions are independent and both stay. All 150 tests in the panel-pipeline suite + the calibrate smoothness tests pass after the merge, including `load_weights` consumers pulled in from main (test_calibrate_save, test_la_land_value_targets).
Follow-up PRs (explicitly not this one)
|
Summary
Turns the single-year cross-sectional pipeline into a full dynamic panel microsimulation: real ONS demographic rates, real UKHLS-estimated within-person transitions, and a one-call composer that ages the UK population forward year-by-year. Closes #345.
No change to the default build. Every new behaviour is opt-in via
create_yearly_snapshots(mode="full_panel")or by callingadvance_year()directly.The PR grew into four layers:
advance_yearand amode="full_panel"oncreate_yearly_snapshotsto chain it all together deterministically under one seed.What's in the pipeline
Every component below ships as opt-in machinery — callers decide which rates to use and when to run transitions.
Panel identity and snapshots
utils/panel_ids.py:PANEL_ID_COLUMNS,get_panel_ids,assert_panel_id_consistency,classify_panel_ids,PanelIDTransition. Documentshousehold_id/benunit_id/person_idas the panel keys.datasets/yearly_snapshots.py:create_yearly_snapshots(base, years, output_dir, *, mode, ...).mode="uprate_only"(default): uprate each target year, IDs preserved byte-for-byte.mode="full_panel": roll the population forward year-by-year viaadvance_year; IDs evolve (deaths / births / migration).storage/upload_yearly_snapshots.py: parallel uploader to the existing private HF destination. Destination constants are not exposed as arguments — redirecting the upload requires a code edit reviewed under CLAUDE.md data-protection rules.Demographic ageing
utils/demographic_ageing.py:age_dataset(base, years, *, mortality_rates, fertility_rates, seed).Mapping[int, float]or sex-specificMapping[str, Mapping[int, float]]for mortality.Household-composition transitions
All in
utils/household_transitions.py:apply_marriages: pair unmarried adults within region, closest-age match, merge benunits (weights folded).apply_separations: split two-adult benunits; children default to the mother.apply_children_leaving_home: move adult dependents (both FRS shapes — dependent young adults on parents' benunit, and separate benunits sharing the parental household) into fresh benunit + household.apply_migration: Poisson-distributed net inflow / outflow per age, inflow cloned from same-age donors, outflow removed cleanly.Labour-market transitions
apply_employment_transitions(dataset, *, ukhls_rates=None, ...):apply_income_decile_transitions: ranks workers into deciles within (age_band, sex), draws a destination decile from the UKHLS matrix, rescales income by the ratio of target/origin decile medians.Year-aware calibration plumbing
targets/build_loss_matrix.py:resolve_target_value(target, year, *, tolerance=3), constantYEAR_FALLBACK_TOLERANCE = 3.local_authorities/loss.pywas readinghousehold_weightat a hard-coded2025.constituencies/loss.pywas readingdataset.time_periodinstead of the passedtime_periodargument in two places.Both now honour the argument.
docs/targets_coverage.md: year coverage across every target category plus the remaining data-sourcing gaps.Cross-year smoothness
utils/calibrate.py:compute_log_weight_smoothness_penalty, plus two keyword-only kwargs oncalibrate_local_areas(prior_weights,smoothness_penalty). When both are supplied, a log-space L2 penalty pulls the current year's weights towards the prior year's. Off by default — pre-step-5 behaviour is reproduced bit-for-bit.Top-level composer
utils/advance_year.py:advance_year(dataset, *, target_year, seed, ...). Chains migration → separation → leaving-home → marriage → employment transitions → income-decile transitions → mortality → fertility → age increment → uprating. One seeded RNG so the whole year is reproducible.Real-data backing
All seven rate sources are either public ONS data (committed alongside the existing ONS xlsx fixtures in
storage/) or aggregated UKHLS derivatives (cell-suppression ≥ 10 applied, raw microdata never committed).targets/sources/ons_mortality.py+storage/ons_national_life_tables.xlsxtargets/sources/ons_fertility.py+storage/ons_asfr.xlsxutils/household_transitions.pyutils/household_transitions.pyutils/household_transitions.pyutils/household_transitions.pydatasets/ukhls.py+utils/ukhls_transitions.py+ committed aggregated CSVsUKHLS pipeline detail
datasets/ukhls.pyreads the UKDS-licensed.dtafiles fromstorage/ukhls/(git-ignored; raw microdata never leaves the machine).utils/ukhls_transitions.pypairs consecutive waves onpidp(cross-wave person key) and estimates:P(state_{t+1} | state_t, age_band, sex)— four-state (IN_WORK / UNEMPLOYED / RETIRED / INACTIVE)P(decile_{t+1} | decile_t, age_band, sex)— within-wave income deciles(age_band, sex, state_from)row groups.storage/:ukhls_employment_state_transitions.csv(≈16 KB)ukhls_income_decile_transitions.csv(≈94 KB)Data-protection safeguards
**/*.dtaandstorage/ukhls/are gitignored (both directory and symlink forms).storage/upload_yearly_snapshots.pylocks the private HF repo + GCS bucket constants at function level; redirecting the upload requires a reviewable code edit.What's not in this PR (explicit non-goals)
create_datasets.py:main()refactor to invokecreate_yearly_snapshotsautomatically — kept as a follow-up so the change-of-default conversation happens in its own PR.resolve_target_value, smoothness penalty), but the full year-loop wiring belongs in a separate step.policyengine-ukchanges — runtime-uprating skip flags, fixture migration — documented indocs/panel_downstream.md, left to the sibling repo.enhanced_frs_2023_24.h5— 176 in-tree tests cover every unit and a composedadvance_yearrun, but the full-panel dry-run throughcreate_yearly_snapshotsis a follow-up I'll attach before moving this out of Draft.Test plan
176 new tests across the stack, all green locally; no real FRS or UKHLS microdata touched in CI (every test uses in-memory synthetic fixtures or committed aggregate CSVs).
Notable coverage
yearsandstorage_folderas kwargs.age_datasetrun.@pytest.mark.skipifpath that exercises the committed aggregate CSVs when present.Running a one-year panel roll
Tracks and closes: #345.