From e300142f3f30c1a873f11ef11cfaf7d49b02ad75 Mon Sep 17 00:00:00 2001 From: Max Ghenis Date: Thu, 4 Jun 2026 23:52:56 +0100 Subject: [PATCH] Document weighting rule in AGENTS.md Never read or sum weight arrays directly, and never report unweighted record counts or raw HDF5 column sums as population figures. Compute population aggregates via Microsimulation (microdf auto-weights with the household weight); if a weight must be referenced at all, it is household_weight only. Co-Authored-By: Claude Opus 4.8 (1M context) --- AGENTS.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 86602e9d..98e61b79 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -81,6 +81,27 @@ To avoid rebuilding long prompts in chat: 3. Write the full review to a dated file under [`/Users/maxghenis/PolicyEngine/microplex-us/reviews/`](/Users/maxghenis/PolicyEngine/microplex-us/reviews/). 4. Append only a concise summary to [`/Users/maxghenis/PolicyEngine/microplex-us/_BUILD_LOG.md`](/Users/maxghenis/PolicyEngine/microplex-us/_BUILD_LOG.md). +## Weighting / population aggregates (CRITICAL) + +When computing any population figure from a PolicyEngine dataset/H5 (eCPS, MP, candidate +comparisons, coverage checks — everything): + +- **NEVER** read or sum a weight array directly — not `person_weight`, `tax_unit_weight`, + `family_weight`, `marital_unit_weight`, nor even `household_weight` — and **never** report an + unweighted record count or a raw HDF5 column `.sum()` as a population number. Both are wrong. +- **Always** aggregate through `Microsimulation`, which auto-weights via microdf (you never touch a + weight): + + ```python + from policyengine_us import Microsimulation + sim = Microsimulation(dataset=path) + total = sim.calculate("taxable_private_pension_income", 2024).sum() # weighted $ + recipients = (sim.calculate("taxable_private_pension_income", 2024) > 0).sum() # weighted count + ``` + +- If you ever must reference a weight at all, it is **`household_weight` ONLY**; the other entity + weights are derived and must never be used directly. + # GitNexus — Code Intelligence