feat(diagnostics): PlaceboTests methodology validation (PR-B) — exact RI p-value + R parity by igerber · Pull Request #555 · igerber/diff-diff

igerber · 2026-06-26T20:53:28Z

Summary

PR-B (the closer) of the PlaceboTests methodology promotion — source validation + methodology tests + R parity + tracker → Complete. Anchored to Bertrand-Duflo-Mullainathan (2004); paper review landed in PR-A (docs(methodology): add Bertrand-Duflo-Mullainathan (2004) paper review #553).
Source fixes (diff_diff/diagnostics.py):
- permutation_test: exact randomization-inference p-value (1 + count)/(B + 1) (Phipson & Smyth 2010), replacing count/B floored at 1/(B+1); the +1 includes the observed statistic (floor now intrinsic). Docstring qualified to "exact under the sharp null" (BDM fn 12). The sampled value converges to the exact full-enumeration count/total.
- placebo_group_test: optional treatment parameter that drops ever-treated units for an uncontaminated placebo, with fail-closed validation (column existence, binary 0/1, no NaN — groupby().max() would otherwise silently skip filtering) + degenerate-design ValueError guards (replacing a cryptic LinAlgError) + misuse UserWarning. The run_placebo_test dispatcher's fake_group path now filters ever-treated units by default; the result records the units actually used.
Tests: tests/test_methodology_placebo.py (23 tests, 5 BDM-anchored classes, 2 @pytest.mark.slow).
R parity: base-R combn exact enumeration — benchmarks/R/generate_placebo_golden.R → benchmarks/data/placebo_golden.json (+ placebo_test_panel.csv). Deterministic leave-one-out / fake-group / observed ATT match R (atol≤1e-10); the sampled permutation p-value converges to the exact enumeration.
Docs/tracker: full REGISTRY ## PlaceboTests entry; METHODOLOGY_REVIEW.md row → Complete (only Survey Data Support now remains In Progress); references.rst (Phipson & Smyth 2010, Rosenbaum 1996); doc-deps.yaml mapping; CHANGELOG.md.

Methodology references (required if estimator / math changes)

Method name(s): PlaceboTests (placebo-law / randomization-inference diagnostics on the base 2×2 DiD)
Paper / source link(s): Bertrand, Duflo & Mullainathan (2004), QJE 119(1):249-275 (https://doi.org/10.1162/003355304772839588); Phipson & Smyth (2010) for the (1+count)/(B+1) RI p-value; Rosenbaum (1996) for randomization inference.
Intentional deviations (documented in REGISTRY ## PlaceboTests with **Note:** labels): the permutation path bypasses safe_inference (RI p-value + null-distribution percentile interval — not an effect CI — + null-mean placebo_effect); leave_one_out_test's se is a dispersion spread, not a design-based jackknife SE; the permutation NaN-decoupling contract (count-based p-value stays valid when se is degenerate); BDM's serial-correlation SE corrections are out of scope for this diagnostic surface.

Validation

Tests added/updated: tests/test_methodology_placebo.py (new, 23 tests); tests/test_diagnostics.py (32 existing, unchanged, green under the new p-value formula). R-parity class is pytest.skip-guarded when the golden is absent.
Backtest / simulation / notebook evidence: docs/tutorials/04_parallel_trends.ipynb re-validated under the new permutation p-value formula.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

github-actions · 2026-06-26T20:57:04Z

Overall Assessment

⚠️ Needs changes — one unmitigated P1 methodology/documentation mismatch.

Executive Summary

Affected methods: permutation_test, placebo_group_test, and run_placebo_test(test_type="fake_group").
placebo_group_test’s never-treated filtering matches the new REGISTRY contract and has reasonable fail-closed guards.
The documented non-safe_inference permutation path is explicitly labeled in REGISTRY.md, so I do not treat it as an anti-pattern defect.
P1: the PR repeatedly describes (1 + count)/(B + 1) as the exact sampled RI p-value, but the implementation samples assignments with replacement; Phipson-Smyth distinguish this from the exact without-replacement case.
I could not run tests in this environment: python -m pytest failed because pytest is not installed.

Methodology

Finding: P1 — “exact” Phipson-Smyth claim does not match the sampling scheme.
Location: diff_diff/diagnostics.py:L645-L695, docs/methodology/REGISTRY.md:L3114-L3120, CHANGELOG.md:L36-L42, tests/test_methodology_placebo.py:L112-L168
Impact: permutation_test draws treatment assignments independently each iteration, so the same assignment, including the observed assignment, can recur. Under Phipson-Smyth, (b + 1)/(m + 1) is the exact p-value for random permutations drawn without replacement; for with-replacement draws it is a valid but conservative upper bound, while the exact with-replacement p-value is different. The implementation’s p-value is defensible, but the PR’s “exact” wording and methodology-promotion evidence overstate what the code computes. (arxiv.org)
Concrete fix: either sample unique assignments without replacement from the finite assignment set when using the (1 + count)/(B + 1) exact claim, or update REGISTRY.md, changelog, docstrings, and test names/comments to call the current value a valid/conservative with-replacement Monte Carlo RI p-value that converges to full enumeration.

No defect: placebo_group_test(..., treatment=...) correctly uses groupby(unit)[treatment].max() after binary/NaN validation to drop ever-treated units, and run_placebo_test(..., test_type="fake_group") propagates the real treatment column into that path.

Code Quality

Finding: P3 — used_fake_treated = sorted(...) can reject mixed-type unit IDs.
Location: diff_diff/diagnostics.py:L552-L556
Impact: Sorting is only metadata cleanup, but mixed comparable/non-comparable unit labels can raise TypeError after successful estimation.
Concrete fix: preserve the user-provided order with a filtered list, or sort with key=str if deterministic display ordering is required.

Performance

No findings. The new fake-group filter is linear in rows, and the permutation path keeps the prior computational profile.

Maintainability

No P1/P2 findings. The new treatment parameter is propagated through the main dispatcher. run_all_placebo_tests() still excludes fake-group by design, so I am not treating that as a propagation bug.

Tech Debt

No blocking tech-debt findings. I did not find a new PlaceboTests deferral in TODO.md; the only issue above should be resolved in-code/docs rather than deferred because it affects methodology claims.

Security

No findings. The added benchmark data and R script do not appear to introduce secrets or sensitive data.

Documentation/Tests

Finding: P2 — methodology tests assert the implementation formula, not the source distinction.
Location: tests/test_methodology_placebo.py:L112-L168
Impact: The tests lock in (1 + count)/(B + 1) for sampled draws but do not catch the with-replacement vs without-replacement exactness distinction.
Concrete fix: add a test that either verifies unique sampled assignments when claiming exactness, or verifies/document the conservative with-replacement contract.

Path to Approval

Update the permutation methodology contract: choose exact without-replacement sampling, or document the current p-value as valid/conservative with-replacement Monte Carlo RI.
Align REGISTRY.md, changelog, docstrings, and test names/comments with that chosen contract.
Add one test covering the chosen contract: unique assignment sampling for exactness, or duplicate-assignment tolerance plus conservative wording for with-replacement sampling.

github-actions · 2026-06-26T21:05:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 761a309d9586c9e1da4b85e215764d106e3cc1f2

Overall Assessment

✅ Looks good — the previous P1 methodology issue is resolved in the implementation, REGISTRY.md, changelog, tests, and permutation_test docstring. I found only P3 documentation cleanup items.

Executive Summary

Affected methods: permutation_test, placebo_group_test, and run_placebo_test(test_type="fake_group").
The sampled permutation p-value is now documented in REGISTRY.md and code as with-replacement Monte Carlo RI, not exact finite-sample enumeration.
placebo_group_test(..., treatment=...) now filters ever-treated units fail-closed and the dispatcher propagates treatment.
Prior P3 mixed-type sorting concern is fixed by preserving caller order.
Could not run tests: python -m pytest tests/test_methodology_placebo.py -q -p no:cacheprovider failed because pytest is not installed.

Methodology

Finding: P3 — stale “exact (1 + count)/(B + 1)” wording remains outside the Registry.
Location: METHODOLOGY_REVIEW.md:L1322-L1347, docs/references.rst:L106-L108
Impact: The authoritative REGISTRY.md correctly says sampled assignments are with replacement and the p-value is valid/conservative rather than exact finite-sample enumeration (docs/methodology/REGISTRY.md:L3114-L3120), so this is not an unmitigated methodology defect. The remaining wording can still confuse future reviewers.
Concrete fix: Replace “Exact randomization-inference p-value (1 + count)/(B + 1)” with “valid/conservative sampled RI p-value (1 + count)/(B + 1); exact value is full enumeration count/total.”

No P0/P1 methodology findings.

Code Quality

No findings. fake_group now records retained fake-treated units without sorting, avoiding the prior mixed-type metadata failure.

Performance

No findings. The new filtering is linear in rows and does not change the permutation loop’s asymptotic behavior.

Maintainability

No findings. The new treatment parameter is propagated through the dispatcher path that calls placebo_group_test.

Tech Debt

No blocking findings. No new untracked correctness debt found in TODO.md.

Security

No findings. The added benchmark CSV/JSON/R script do not expose secrets or sensitive data.

Documentation/Tests

Finding: P3 — methodology test count in tracker is stale.
Location: METHODOLOGY_REVIEW.md:L1330-L1332, tests/test_methodology_placebo.py:L115-L507
Impact: The tracker says 19 tests, but the file contains 24 test_ methods. This is informational and does not affect behavior.
Concrete fix: Update the tracker count to match the committed test file.

github-actions · 2026-06-26T21:10:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e0e6452036ebbae2b27c23aa44cc6f5ff352d039

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected methods: permutation_test, placebo_group_test, and run_placebo_test(test_type="fake_group").
The prior stale “exact sampled p-value” wording is corrected; sampled RI is now documented as valid/conservative, with exact reserved for full enumeration.
The prior stale test-count item is fixed: METHODOLOGY_REVIEW.md now says 24 tests, matching the file.
The inline permutation inference path is documented in REGISTRY.md as an intentional deviation, so it is not a P1 anti-pattern.
I could not run tests in this environment: pytest and numpy are not installed.

Methodology

Finding: P3 informational — documented permutation inference deviation, no action required.
Location: diff_diff/diagnostics.py:L694-L707, docs/methodology/REGISTRY.md:L3123-L3140
Impact: permutation_test intentionally bypasses safe_inference, keeps a finite count-based RI p-value when se is degenerate, and uses a null-distribution percentile interval rather than an effect CI. This is explicitly documented with Note labels in the registry, so it is mitigated.
Concrete fix: None required.

No P0/P1 methodology findings.

Code Quality

No findings. The new treatment parameter is propagated through the dispatcher path and the fake-group metadata preserves caller order after filtering.

Performance

No findings. The ever-treated filtering is linear in the input rows and does not change permutation-loop complexity.

Maintainability

No findings. The registry, changelog, reference docs, R generator, and tests describe the same p-value and fake-group filtering contracts.

Tech Debt

No blocking findings. I did not find new untracked correctness debt for PlaceboTests in TODO.md.

Security

No findings. The added benchmark CSV/JSON/R files do not appear to contain secrets or sensitive data.

Documentation/Tests

Finding: P3 verification limitation — tests not executable in this review environment.
Location: tests/test_methodology_placebo.py, tests/test_diagnostics.py
Impact: I could not run the focused tests because python -m pytest ... failed with No module named pytest, and a plain import smoke check failed with No module named numpy. Static review found the prior documentation issues resolved.
Concrete fix: Install the project test dependencies and run python -m pytest tests/test_methodology_placebo.py tests/test_diagnostics.py -q.

…alue PR-B of the PlaceboTests methodology promotion (Bertrand-Duflo-Mullainathan 2004). Source validation + methodology tests + R parity + tracker -> Complete. Source fixes (diff_diff/diagnostics.py): - permutation_test: exact randomization-inference p-value (1 + count)/(B + 1) (Phipson & Smyth 2010), replacing count/B floored at 1/(B+1); the +1 includes the observed statistic, making the docstring's "exact" claim faithful (now qualified to "exact under the sharp null", BDM fn 12). - placebo_group_test: optional `treatment` parameter that drops ever-treated units for an uncontaminated placebo; degenerate-design ValueError guards (replacing a cryptic LinAlgError) + misuse UserWarning; docstring corrected for both modes. The run_placebo_test dispatcher's fake_group path now filters ever-treated units by default. Validation: - tests/test_methodology_placebo.py (19 tests, 5 BDM-anchored classes). - R parity via base-R combn exact enumeration: benchmarks/R/generate_placebo_golden.R -> benchmarks/data/placebo_golden.json (+ placebo_test_panel.csv). - REGISTRY.md ## PlaceboTests full entry; METHODOLOGY_REVIEW.md row -> Complete. - references.rst: Phipson & Smyth (2010), Rosenbaum (1996); doc-deps mapping; CHANGELOG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-26T21:39:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 05aebbe03f314eeb12f36cbc292130c704ca7634

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Affected methods: permutation_test, placebo_group_test, and run_placebo_test(test_type="fake_group").
The RI p-value change to (1 + count)/(B + 1) matches the new PlaceboTests registry contract and is tested against exact enumeration.
The inline permutation inference path is explicitly documented as an intentional deviation, so it is P3 informational, not a P1 anti-pattern.
The new fake-group treatment filter is propagated through the dispatcher and has fail-closed validation for missing, NaN, and non-binary treatment columns.
I found one P2 documentation drift: existing API/tutorial examples still show placebo_group_test on full data without treatment=... or pre-filtering to controls.
I could not run pytest because pytest and numpy are not installed; I did verify the committed golden values with a standard-library recomputation.

Methodology

Finding: P3 informational — documented permutation inference deviation, no action required.
Location: diff_diff/diagnostics.py:L693-L707, docs/methodology/REGISTRY.md:L3114-L3128, docs/methodology/REGISTRY.md:L3138-L3141
Impact: permutation_test intentionally bypasses safe_inference, keeps a count-based RI p-value when se is degenerate, and reports a null-distribution percentile interval rather than an effect CI. This is explicitly documented with Note labels in the registry and matches the BDM paper-review discussion that the placebo/randomization distribution is the reference distribution rather than a normal/t approximation (docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md:L30-L39, L62-L70).
Concrete fix: None required.

No P0/P1 methodology findings.

Code Quality

No findings. The new treatment parameter is not a constructor parameter, so get_params() propagation is not applicable. The dispatcher passes it through for fake_group at diff_diff/diagnostics.py:L308-L317, and direct calls remain backward-compatible.

Performance

No findings. The fake-group ever-treated filtering is linear in the input rows, and the permutation loop complexity is unchanged.

Maintainability

No findings. The registry, methodology tracker, golden generator, and tests now describe the same RI p-value and fake-group filtering contracts.

Tech Debt

No blocking findings. I did not find a PlaceboTests-specific deferred correctness issue in TODO.md; the remaining item below is documentation-only and not a silent correctness bug.

Security

No findings. I searched the changed source, docs, fixtures, and benchmark files for common secret patterns; no hits.

Documentation/Tests

Finding: P2 — stale direct placebo_group_test examples can demonstrate a contaminated fake-group design.
Location: docs/api/diagnostics.rst:L55-L62, docs/tutorials/04_parallel_trends.ipynb:L434-L440
Impact: The registry and docstring now say direct placebo_group_test calls must either pass treatment= or supply control-only data. These examples select fake-treated units from controls but still pass the full panel without treatment=..., so real-treated units remain in the comparison group.
Concrete fix: Add treatment='treated' to the examples, or pre-filter the input to never-treated/control-only rows before calling placebo_group_test.

Finding: P3 verification limitation — tests not executable in this environment.
Location: tests/test_methodology_placebo.py, tests/test_diagnostics.py
Impact: python -m pytest tests/test_methodology_placebo.py tests/test_diagnostics.py -q failed with No module named pytest; importing numpy also failed. Static review plus a standard-library recomputation matched the committed golden ATT, exact count/total p-value, LOO spread, and fake-group ATT.
Concrete fix: Install test dependencies and run python -m pytest tests/test_methodology_placebo.py tests/test_diagnostics.py -q.

igerber force-pushed the feature/placebo-methodology-validation branch from 9c0a8ea to 761a309 Compare June 26, 2026 21:02

igerber force-pushed the feature/placebo-methodology-validation branch from 761a309 to e0e6452 Compare June 26, 2026 21:07

igerber added the ready-for-ci Triggers CI test workflows label Jun 26, 2026

igerber force-pushed the feature/placebo-methodology-validation branch from e0e6452 to 85cd44e Compare June 26, 2026 21:29

igerber force-pushed the feature/placebo-methodology-validation branch from 85cd44e to 05aebbe Compare June 26, 2026 21:31

igerber merged commit c48b5b8 into main Jun 27, 2026
26 checks passed

igerber deleted the feature/placebo-methodology-validation branch June 27, 2026 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diagnostics): PlaceboTests methodology validation (PR-B) — exact RI p-value + R parity#555

feat(diagnostics): PlaceboTests methodology validation (PR-B) — exact RI p-value + R parity#555
igerber merged 1 commit into
mainfrom
feature/placebo-methodology-validation

igerber commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 26, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant