Add survey support to dCDH estimator by igerber · Pull Request #307 · igerber/diff-diff

igerber · 2026-04-16T01:02:19Z

Summary

Add survey support (pweight + strata/PSU/FPC) to ChaisemartinDHaultfoeuille via Taylor Series Linearization
Weighted cell aggregation: survey weights produce weighted cell means for the point estimate
IF expansion from group to observation level for design-based variance (library extension, documented in REGISTRY.md)
Survey-aware SE at all call sites: overall, joiners, leavers, multi-horizon, placebos
Bootstrap + survey emits warning (PSU-level bootstrap deferred)
Observation-level varying weights/PSU/strata within groups are supported
13 new tests in test_survey_dcdh.py; 171 total dCDH tests pass
Documentation updates: REGISTRY.md, ROADMAP.md, llms-full.txt, choosing_estimator.rst

Methodology references (required if estimator / math changes)

Method name(s): ChaisemartinDHaultfoeuille survey extension
Paper / source link(s): de Chaisemartin & D'Haultfoeuille (2020, 2024); Binder (1983) for TSL variance
Any intentional deviations from the source (and why): The IF expansion psi_i = U[g] * (w_i / W_g) is a library extension not in the dCDH papers (the paper assumes iid sampling). Documented as a Note: in REGISTRY.md.

Validation

Tests added/updated: tests/test_survey_dcdh.py (new, 13 tests), tests/test_chaisemartin_dhaultfoeuille.py (updated gate test)
Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-04-16T01:11:36Z

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the test suite in this shell because the available python is missing numpy.

Executive Summary

Affected method: the new survey extension for ChaisemartinDHaultfoeuille touches DID_M, DID_+, DID_-, multi-horizon DID_l, and multi-horizon placebo inference.
The new REGISTRY note documents the observation-level IF expansion as a library extension, so I am not treating the existence of that deviation as a defect by itself docs/methodology/REGISTRY.md:651.
P1: the analytical survey SE path appears to drop the N_S / N_l normalization and feeds numerator-scale dCDH group IFs into compute_survey_if_variance(), which is documented to expect estimator-scale psi.
P1: weighted cell aggregation has no zero-total-weight guard, but downstream availability/control logic still keys off raw n_gt, so zero-weight cells can be treated as valid switcher/control cells with undefined y_gt.
P2: the new tests do not pin survey SE scale or zero-weight-cell behavior, so both blocking issues can pass CI.
P3: survey.py now contains an unused helper whose docstring contradicts the new supported contract on within-group survey variation.

Methodology

The documented extension itself is acceptable; the blockers below are implementation errors inside that extension.

Severity: P1. Impact: the survey SE path is mis-scaled. dCDH’s own IF builders return numerator-scale contributions, not estimator-scale IFs: _compute_full_per_group_contributions() documents U.sum() == N_S * DID_M, and _compute_per_group_if_multi_horizon() documents sum(U_l) == N_l * DID_l diff_diff/chaisemartin_dhaultfoeuille.py:4242 diff_diff/chaisemartin_dhaultfoeuille.py:3686. _plugin_se() then divides by divisor to get the estimator SE diff_diff/chaisemartin_dhaultfoeuille.py:4494. But _compute_se() ignores divisor on the survey branch, and _survey_se_from_group_if() expands observations so that sum_i psi_i = U[g] before calling compute_survey_if_variance() diff_diff/chaisemartin_dhaultfoeuille.py:4535 diff_diff/chaisemartin_dhaultfoeuille.py:4558. The shared survey contract says compute_survey_if_variance() expects score/estimator-scale psi already normalized, analogous to w_i * eif_i / sum(w) diff_diff/survey.py:1504 docs/methodology/survey-theory.md:489. As written, the code is estimating the variance of N_* * estimand, not the estimand, across overall/joiners/leavers/event-study/placebo analytical outputs. Concrete fix: scale U_centered by 1 / divisor before expansion, or equivalently divide the final sqrt(variance) by divisor, and add a regression test on a hand-checkable panel.
Severity: P1. Impact: zero-weight cells are not handled safely. SurveyDesign.resolve() explicitly allows zero weights as long as not all weights are zero diff_diff/survey.py:173. The new weighted aggregation computes y_gt = _wy_sum / w_gt with no guard for w_gt <= 0 diff_diff/chaisemartin_dhaultfoeuille.py:217. Downstream, the estimator still builds N_mat from raw n_gt counts and uses N_mat > 0 for presence/A11/control-pool logic diff_diff/chaisemartin_dhaultfoeuille.py:1032 diff_diff/chaisemartin_dhaultfoeuille.py:2701. That means a cell with zero survey weight can still count as an observed joiner/leaver/control even though its weighted mean is undefined, which breaks the new survey control-availability logic and can propagate NaNs into estimates. Concrete fix: reject any (group, time) cell with w_gt <= 0 before pivoting, or redefine “present” everywhere on positive survey weight rather than raw observation count.

Code Quality

No material findings beyond the methodology issues above.

Performance

No material findings in the changed code.

Maintainability

Severity: P3. Impact: _validate_group_constant_survey() says dCDH survey columns “must not vary within groups” diff_diff/survey.py:914, but fit() now documents the opposite and the new tests assert that varying weights within groups are supported diff_diff/chaisemartin_dhaultfoeuille.py:653 tests/test_survey_dcdh.py:267. Because the helper is unused, this is not a current runtime bug, but it is dead contradictory guidance that can cause a future regression. Concrete fix: delete the helper or rewrite it to match the actual supported contract.

Tech Debt

Severity: P3. Impact: these blockers are not mitigated by existing tracked debt. The dCDH entries in TODO.md cover placebo SE and parity limitations, not survey SE scaling or zero-weight-cell handling TODO.md:59. Concrete fix: fix these in the PR; a TODO entry would not downgrade them because they are correctness issues.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new test file would not catch either blocker. test_uniform_weights_match_unweighted() only checks ATT equality, not SE scale tests/test_survey_dcdh.py:69, and test_strata_psu_changes_se() only checks that two SEs differ, not that either is correct tests/test_survey_dcdh.py:192. There is also no zero-weight-cell test in the new suite. Concrete fix: add one hand-checkable analytical survey-SE test that pins the divisor normalization, plus a zero-weight-cell test that asserts explicit failure or consistent exclusion.

Path to Approval

Normalize the survey IF passed to compute_survey_if_variance() by the relevant dCDH divisor (N_S, joiner_total, leaver_total, N_l, N_pl_l) for every analytical survey SE path.
Add explicit handling for w_gt <= 0 cells and make the presence/control-availability logic consistent with that rule.
Add regression tests for the two cases above, plus at least one survey multi-horizon/placebo inference case so the affected analytical survey paths are covered.

igerber · 2026-04-16T10:09:52Z

/ai-review

github-actions · 2026-04-16T10:22:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1ed49bce753bbee3d52cfa613b9e89b315d174cf

Overall Assessment
⚠️ Needs changes

Static review only; I could not run pytest here because this shell is missing numpy and pytest.

Executive Summary

Affected methods: the new survey path touches DID_M, DID_+, DID_-, multi-horizon DID_l, multi-horizon placebos, and derived delta / DID^n_l / DID^{fd} inference in diff_diff/chaisemartin_dhaultfoeuille.py:505.
The REGISTRY note documenting the observation-level IF expansion psi_i = U[g] * (w_i / W_g) is sufficient; I am not treating that library extension itself as a defect docs/methodology/REGISTRY.md:651.
The prior survey-SE normalization blocker appears fixed: _compute_se() now rescales numerator-scale dCDH IFs before calling the survey TSL helper, and the new scale-pinning test matches that intent diff_diff/chaisemartin_dhaultfoeuille.py:4541, tests/test_survey_dcdh.py:388.
The earlier zero-weight-cell issue is only partially addressed: zero-weight cells no longer count in N_mat, but the ragged-panel validator still treats their rows as observed periods.
[Newly identified] survey support is not fully propagated to controls or to derived analytical inference surfaces (delta, normalized_effects, linear_trends_effects).
The new tests add good coverage for the main survey path, but they do not cover the blocking interactions above.

Methodology
The documented deviation itself is acceptable: the survey IF expansion is recorded in the Methodology Registry as a **Note:** and aligns with the shared compute_survey_if_variance() score-scale contract docs/methodology/REGISTRY.md:652, diff_diff/survey.py:1463.

Severity: P1 [Newly identified]. Impact: zero-weight cells are declared “absent” by setting n_gt=0, but Step 5b still decides baseline/interior-gap status from raw row presence rather than positive-weight presence. That conflicts with the documented dCDH ragged-panel contract and can keep effective interior-gap groups in-sample, after which _compute_group_switch_metadata() silently skips the missing-weight transitions and can misclassify F_g / S_g diff_diff/chaisemartin_dhaultfoeuille.py:230, diff_diff/chaisemartin_dhaultfoeuille.py:820, diff_diff/chaisemartin_dhaultfoeuille.py:842, diff_diff/chaisemartin_dhaultfoeuille.py:3460, docs/methodology/REGISTRY.md:610, tests/test_survey_dcdh.py:427. Concrete fix: either drop w_gt <= 0 rows from cell before Step 5b, or rewrite baseline/interior-gap detection to use n_gt > 0 everywhere; then add zero-weight baseline and zero-weight interior-gap tests.
Severity: P1 [Newly identified]. Impact: survey weighting is only applied to y_gt; controls are still aggregated with unweighted within-cell means. For DID^X, that mixes weighted outcome cells with unweighted covariate cells, which violates the stated survey point-estimate contract whenever within-cell weights vary diff_diff/chaisemartin_dhaultfoeuille.py:134, diff_diff/chaisemartin_dhaultfoeuille.py:680, diff_diff/chaisemartin_dhaultfoeuille.py:743, docs/methodology/REGISTRY.md:612. Concrete fix: aggregate each control column as sum(w_i x_i) / sum(w_i) when survey_design is active and add a regression test with duplicated observations and within-cell-varying weights.
Severity: P1 [Newly identified]. Impact: the PR correctly threads df_survey into the main analytical survey paths (overall, joiners, leavers, event-study horizons, placebo horizons), but the derived outputs still hard-code df=None. That makes analytical survey p-values/CIs for delta, normalized_effects, and linear_trends_effects revert to z-based inference instead of the survey t-based rule already used elsewhere diff_diff/chaisemartin_dhaultfoeuille.py:1494, diff_diff/chaisemartin_dhaultfoeuille.py:1725, diff_diff/chaisemartin_dhaultfoeuille.py:2133, diff_diff/chaisemartin_dhaultfoeuille.py:2232, diff_diff/chaisemartin_dhaultfoeuille.py:2290. Concrete fix: thread _df_survey through those safe_inference() calls and add small-PSU tests that distinguish t-based from z-based outputs.

Code Quality

No additional findings beyond the methodology propagation issues above.

Performance

No material performance findings in the changed code.

Maintainability

Severity: P3. Impact: the dCDH REGISTRY overview still says survey support is “deferred,” while the checklist later says it shipped, so the designated methodology source of truth is internally inconsistent docs/methodology/REGISTRY.md:466, docs/methodology/REGISTRY.md:651. Concrete fix: update the overview paragraph and add an explicit dCDH survey-bootstrap **Note:** in the method section.

Tech Debt

Severity: P3. Impact: the blocking items above are not mitigated by tracked debt; the existing dCDH TODO entries only cover single-period placebo SE and parity-test coverage TODO.md:51, TODO.md:59. Concrete fix: resolve them in this PR rather than deferring.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new suite exercises overall survey fitting, SE scale pinning, and a basic zero-weight case, but it does not cover survey_design + controls, survey L_max>1 top-level delta inference, survey linear_trends_effects, or zero-weight baseline/interior-gap behavior; test_zero_weight_cell_excluded currently encodes the inconsistent expectation that a zero-weight interior cell should “just be absent” without triggering the documented ragged-panel handling tests/test_survey_dcdh.py:267, tests/test_survey_dcdh.py:347, tests/test_survey_dcdh.py:388, tests/test_survey_dcdh.py:427. Concrete fix: add targeted tests for those paths and assert either agreement with explicit row removal or the documented warning/error contract.

Path to Approval

Make zero-weight cells fully obey the “absent cell” contract before Step 5b: either drop w_gt <= 0 rows immediately after aggregation or make baseline/interior-gap validation key off n_gt > 0. Add one test for a zero-weight baseline cell and one for a zero-weight interior-gap cell.
Apply survey-weighted cell aggregation to controls in the DID^X path. Add a regression test with duplicated observations and within-cell-varying weights where weighted and unweighted cell covariates differ.
Propagate resolved_survey.df_survey into the safe_inference() calls for cost_benefit_delta, normalized_effects, and linear_trends_effects, and add small-PSU tests that verify those outputs use survey t-based inference rather than z-based inference.

igerber · 2026-04-16T11:04:09Z

/ai-review

github-actions · 2026-04-16T11:15:29Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cf17cb2fdeb2a87807b5dff6b24d24e18bb5b43b

Overall Assessment
⛔ Blocker

Static review only; this shell does not have pytest or numpy, so I could not execute the new survey tests.

Executive Summary

Affected methods: the PR adds survey support to dCDH’s survey extension over the dCDH IF, touching the main DID_M / DID_l inference surfaces plus derived delta and HonestDiD paths. The registry’s documented IF expansion psi_i = U[g] * (w_i / W_g) is explicit, so I am not treating that extension itself as a defect. docs/methodology/REGISTRY.md:L620-L652
The prior zero-weight-cell and weighted-controls findings appear fixed: zero-weight cells are now dropped before ragged-panel validation, and covariates are aggregated as weighted cell means under survey_design. diff_diff/chaisemartin_dhaultfoeuille.py:L212-L235 diff_diff/chaisemartin_dhaultfoeuille.py:L742-L763
Most survey-df propagation is also fixed (overall, joiners, leavers, DID_l, DID^{pl}_l, DID^n_l, DID^{fd}), but the top-level delta surface still hard-codes df=None, which can silently emit z-based overall_p_value/CI where survey inference should be t-based or NaN. diff_diff/chaisemartin_dhaultfoeuille.py:L1740-L1775 diff_diff/chaisemartin_dhaultfoeuille.py:L2146-L2150 diff_diff/utils.py:L152-L182
[Newly identified] honest_did=True on survey-backed dCDH results still routes through an extraction branch that explicitly assumes “dCDH has no survey support” and drops df_survey, even though the estimator now exposes both survey metadata and HonestDiD integration. diff_diff/chaisemartin_dhaultfoeuille.py:L2538-L2545 diff_diff/honest_did.py:L970-L977 docs/methodology/REGISTRY.md:L620-L652
The new tests cover basic survey fit, event-study execution, bootstrap warning, SE scale pinning, and zero-weight cells, but they do not cover survey delta, survey controls, or survey honest_did, and the HonestDiD suite still codifies the stale “no dCDH survey support” assumption. tests/test_survey_dcdh.py:L345-L444 tests/test_honest_did.py:L1129-L1222 tests/test_honest_did.py:L1372-L1384

Methodology
Methods affected: dCDH survey support via Binder-style TSL on the dCDH IF, Phase 2 delta, and dCDH HonestDiD integration. The documented survey IF expansion is covered by the registry note and is not a defect. docs/methodology/REGISTRY.md:L651-L652

Severity: P0. Impact: for survey-backed L_max >= 2 fits, the public overall_* surface is delta, but its inference still calls safe_inference(..., df=None) even though the same fit threads _df_survey into the other analytical survey surfaces. That silently reverts delta to z-based inference, and when df_survey <= 0 it bypasses safe_inference()’s NaN guard and can return finite overall_p_value / overall_conf_int where survey inference should be undefined. diff_diff/chaisemartin_dhaultfoeuille.py:L1740-L1775 diff_diff/chaisemartin_dhaultfoeuille.py:L2146-L2150 diff_diff/utils.py:L152-L182 Concrete fix: pass _df_survey into the safe_inference(delta_val, delta_se, ...) call and add regressions for both low positive survey d.f. and df_survey=0.
Severity: P1 [Newly identified]. Impact: fit(..., honest_did=True, survey_design=...) now reaches compute_honest_did(results), but the dCDH extraction branch still hard-codes df_survey=None with the comment “dCDH has no survey support”; the existing HonestDiD test locks that behavior in. That means survey-backed dCDH HonestDiD bounds silently use non-survey critical values instead of the survey-d.f. logic already used for other estimators. diff_diff/chaisemartin_dhaultfoeuille.py:L2538-L2545 diff_diff/honest_did.py:L970-L977 tests/test_honest_did.py:L1129-L1222 tests/test_honest_did.py:L1372-L1384 Concrete fix: return results.survey_metadata.df_survey for dCDH in _extract_event_study_params() (or explicitly reject the survey_design + honest_did combination until supported) and add a survey-backed dCDH HonestDiD regression.

Code Quality

No additional findings beyond the inference-propagation issues above.

Performance

No material performance findings in the changed code.

Maintainability

No additional findings beyond the stale HonestDiD survey assumption already called out above.

Tech Debt

No mitigating TODO.md entry covers the remaining survey-delta or survey-HonestDiD issues, so they are unmitigated and should not be deferred in this PR.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new survey suite stops at “multi-horizon runs”, bootstrap warning, SE scale pinning, and zero-weight behavior; it does not assert survey inference on top-level delta, does not exercise survey + controls, and does not cover survey + honest_did=True. The stale HonestDiD extraction test still asserts df_s is None for dCDH, which is why the new survey/HonestDiD propagation hole survived. tests/test_survey_dcdh.py:L345-L444 tests/test_honest_did.py:L1372-L1384 Concrete fix: add targeted regressions for survey delta inference (overall_* when L_max=2), survey + controls, and survey + honest_did=True, and replace the stale df_s is None expectation with survey-aware assertions.

Path to Approval

Thread _df_survey into the top-level delta safe_inference() call and add regressions showing that overall_p_value / overall_conf_int match survey-t inference for small positive d.f. and become NaN when df_survey <= 0.
Make dCDH HonestDiD survey-aware by propagating results.survey_metadata.df_survey through _extract_event_study_params() (or by rejecting survey_design + honest_did explicitly), and add a survey-backed dCDH HonestDiD test to replace the stale df_s is None expectation.

igerber · 2026-04-16T20:45:01Z

/ai-review

github-actions · 2026-04-16T20:53:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4eca2322577294d8bfaefbb3c9d2f5a26bc5e4c4

Overall Assessment

⚠️ Needs changes

Static review only; this shell does not have numpy, so I could not execute the new survey tests.

Executive Summary

The prior blockers called out in the last review appear fixed: delta now threads _df_survey, and dCDH HonestDiD now propagates df_survey instead of hard-coding None. diff_diff/chaisemartin_dhaultfoeuille.py:L2146-L2150 diff_diff/honest_did.py:L970-L987 tests/test_survey_dcdh.py:L447-L479 tests/test_survey_dcdh.py:L549-L575
The survey IF expansion itself is documented in the Methodology Registry, so I am not treating psi_i = U[g] * (w_i / W_g) as a defect. docs/methodology/REGISTRY.md:L651-L652
Severity P1 [Newly identified]: survey_design does not propagate to the heterogeneity= path. Survey-backed fits can now return plain OLS / normal-theory heterogeneity inference with no rejection or documentation. diff_diff/chaisemartin_dhaultfoeuille.py:L2333-L2395 diff_diff/chaisemartin_dhaultfoeuille.py:L3279-L3292
Severity P1 [Newly identified]: survey-backed fitted TWFE diagnostics are no longer reproducible through the public twowayfeweights() helper, despite the registry and inline comments still promising parity between the two APIs on the same input. diff_diff/chaisemartin_dhaultfoeuille.py:L679-L685 diff_diff/chaisemartin_dhaultfoeuille.py:L766-L772 diff_diff/chaisemartin_dhaultfoeuille.py:L4887-L4944 docs/methodology/REGISTRY.md:L600-L600
The new survey suite materially improves coverage, but it still does not exercise survey heterogeneity or survey TWFE diagnostic/helper parity. tests/test_survey_dcdh.py:L345-L575 tests/test_chaisemartin_dhaultfoeuille.py:L2775-L2940

Methodology

Affected methods: the core dCDH survey extension, the Phase 3 heterogeneity test, and the TWFE diagnostic surface.

Severity: P1 [Newly identified]. Impact: survey_design + heterogeneity is now accepted, but the heterogeneity branch is still plain OLS with non-survey inference. fit() only rejects controls, trends_linear, and trends_nonparam; it does not reject survey-backed heterogeneity. _compute_heterogeneity_test() then calls solve_ols(..., return_vcov=True) with no weights and finishes with safe_inference(..., df=None), so results.heterogeneity_effects silently ignore the survey design even though the main dCDH surface is survey-aware and the registry’s heterogeneity note does not list survey as unsupported. diff_diff/chaisemartin_dhaultfoeuille.py:L2333-L2395 diff_diff/chaisemartin_dhaultfoeuille.py:L3279-L3292 docs/methodology/REGISTRY.md:L618-L618 Concrete fix: either reject heterogeneity when survey_design is present, or implement a survey-aware heterogeneity regression and document that contract in REGISTRY.md.
Severity: P1 [Newly identified]. Impact: survey support breaks the documented parity between the fitted TWFE diagnostic and the standalone helper. fit() now aggregates cells with weights=survey_weights, so results.twfe_* can depend on survey-weighted cell means, but twowayfeweights() still has no survey_design input and always uses unweighted aggregation. The registry and inline comments still say the fitted and standalone APIs “always produce identical numbers on the same input,” which is no longer true for survey-backed fits. This is an incomplete parameter propagation across a public helper. diff_diff/chaisemartin_dhaultfoeuille.py:L679-L685 diff_diff/chaisemartin_dhaultfoeuille.py:L766-L772 diff_diff/chaisemartin_dhaultfoeuille.py:L4887-L4944 docs/methodology/REGISTRY.md:L600-L600 Concrete fix: either add survey_design support to twowayfeweights() and route the same weighted aggregation path, or explicitly disable/reject twfe_diagnostic=True under survey_design and update the registry note.

Code Quality

No additional findings beyond the methodology propagation issues above.

Performance

No material performance findings in the changed code.

Maintainability

Severity: P3. Impact: the survey-bootstrap contract is now split across sources. Runtime behavior and the estimator support matrix say survey + bootstrap uses a group-level warning path, but the dCDH methodology registry still says PSU-level bootstrap is deferred and does not carry a matching Note for the implemented fallback. That is documentation drift rather than a proven correctness bug, but it makes the methodology contract harder to audit. diff_diff/chaisemartin_dhaultfoeuille.py:L1814-L1823 docs/choosing_estimator.rst:L728-L732 docs/methodology/REGISTRY.md:L466-L466 docs/methodology/REGISTRY.md:L651-L652 Concrete fix: add a REGISTRY.md Note for the warning-path bootstrap semantics, or reject survey_design + n_bootstrap > 0 until PSU-level bootstrap is implemented.

Tech Debt

No relevant TODO.md entry mitigates the P1 items above; the tracked dCDH review debt only covers older placebo/parity limitations. TODO.md:L51-L60

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new survey suite now covers core survey ATT, bootstrap warning, delta, controls, and HonestDiD, but it still does not cover survey heterogeneity or survey TWFE diagnostic/helper parity. Existing heterogeneity and Design-2 tests remain non-survey only. That is why the two propagation gaps above can land without a failing regression. tests/test_survey_dcdh.py:L345-L575 tests/test_chaisemartin_dhaultfoeuille.py:L2775-L2940 Concrete fix: add one survey-backed heterogeneity test (or explicit rejection test) and one survey TWFE diagnostic test that checks either fit-vs-helper parity or the intended rejection/warning behavior.

Path to Approval

Make survey_design + heterogeneity explicit and correct: either reject that combination in fit() with a clear ValueError and registry note, or implement survey-aware heterogeneity estimation/inference and add a regression test for it.
Make the TWFE diagnostic surface consistent under survey input: either extend twowayfeweights() to accept survey_design and match the fitted path, or reject twfe_diagnostic=True when survey_design is present and update the registry contract accordingly.
Add targeted survey regressions for the chosen behavior in items 1 and 2 so CI will fail if those propagation paths regress again.

igerber · 2026-04-16T21:32:07Z

/ai-review

github-actions · 2026-04-16T21:47:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 97789f5d5193522127e6cc8f5cffe2c2ca372c20

Overall Assessment

⚠️ Needs changes

Static review only. I could not run the test suite in this shell because pytest is not installed.

Executive Summary

The prior re-review blockers look resolved: survey heterogeneity now threads _obs_survey_info into _compute_heterogeneity_test diff_diff/chaisemartin_dhaultfoeuille.py#L2386, the standalone helper now accepts survey_design diff_diff/chaisemartin_dhaultfoeuille.py#L5018, and df_survey now reaches both delta and HonestDiD diff_diff/chaisemartin_dhaultfoeuille.py#L2146, diff_diff/honest_did.py#L970.
The survey IF expansion psi_i = U[g] * (w_i / W_g) and the survey+bootstrap fallback are explicitly documented Notes in the registry, so I am not treating those library extensions as defects docs/methodology/REGISTRY.md#L651.
Severity P1: the new survey TWFE diagnostic is still a hybrid. Survey-weighted cell means are computed, but the FE regressions and Theorem 1 decomposition still use raw n_gt counts instead of w_gt, so twfe_weights, twfe_beta_fe, twfe_fraction_negative, and twfe_sigma_fe do not correspond to an observation-level pweighted TWFE estimator on survey-backed inputs.
Severity P1: zero-weight observations still participate in cell validation/counting. Because _validate_and_aggregate_to_cells() computes d_min, d_max, and n_gt on all rows before only dropping all-zero cells, valid survey/subpopulation inputs can spuriously trip the fuzzy-DiD guard or leak excluded rows into downstream counts.
Severity P2: the new survey tests mostly prove path execution and fit/helper parity, not weighted-correctness or mixed zero-weight-row behavior, so the two P1s above can still pass.

Methodology

Severity P3. Impact: the survey IF expansion and the survey+bootstrap fallback are documented in the Methodology Registry, so they are accepted library extensions rather than methodology defects docs/methodology/REGISTRY.md#L651. Concrete fix: none.
Severity P1. Impact: survey-backed TWFE output is not the survey-weighted TWFE decomposition the docs now imply. _validate_and_aggregate_to_cells() computes w_gt for weighted cells diff_diff/chaisemartin_dhaultfoeuille.py#L220, and the registry says survey support covers the TWFE diagnostic docs/methodology/REGISTRY.md#L600, docs/methodology/REGISTRY.md#L651. But _compute_twfe_diagnostic() still uses n_gt for both FE regressions and the Theorem 1 normalization/shares diff_diff/chaisemartin_dhaultfoeuille.py#L4831, diff_diff/chaisemartin_dhaultfoeuille.py#L4840, diff_diff/chaisemartin_dhaultfoeuille.py#L4846, diff_diff/chaisemartin_dhaultfoeuille.py#L4883, diff_diff/chaisemartin_dhaultfoeuille.py#L4900. On any survey-backed input with unequal cell weights, those outputs silently stop matching the observation-level pweighted TWFE estimand. Concrete fix: when w_gt is present, use it everywhere _compute_twfe_diagnostic() currently uses n_gt; if that contract is not ready, reject survey_design for twfe_diagnostic / twowayfeweights() and document the limitation.

Code Quality

Severity P1. Impact: zero-weight rows are still treated as real rows during survey aggregation. SurveyDesign.resolve() allows zero weights as long as the whole sample is not zero diff_diff/survey.py#L173, and SurveyDesign.subpopulation() defines excluded observations by setting their weights to zero diff_diff/survey.py#L524. But _validate_and_aggregate_to_cells() computes d_min, d_max, and n_gt over all rows diff_diff/chaisemartin_dhaultfoeuille.py#L223, diff_diff/chaisemartin_dhaultfoeuille.py#L226, and only then checks for within-cell treatment variation diff_diff/chaisemartin_dhaultfoeuille.py#L249. A retained cell can therefore fail with a fuzzy-DiD error because of a zero-weight observation that is outside the effective sample, and the same excluded rows inflate downstream counts such as n_obs diff_diff/chaisemartin_dhaultfoeuille.py#L1014. Concrete fix: when weights is provided, drop zero-weight rows before computing d_min/d_max/n_gt, or compute those quantities on the positive-weight subset only.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: neither P1 issue is tracked in TODO.md, so neither is mitigated by the project’s deferred-work policy TODO.md#L51. Concrete fix: address them in this PR rather than deferring them.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests do not pin the survey math that changed. TestSurveyTWFEParity only checks fit-vs-helper parity tests/test_survey_dcdh.py#L685, TestZeroWeightCells only checks an all-zero cell drop tests/test_survey_dcdh.py#L427, and the varying-weight test only exercises positive weights tests/test_survey_dcdh.py#L270. That is why the two survey-weight propagation bugs above can pass. Concrete fix: add one oracle test comparing twfe_beta_fe / decomposition weights to an observation-level pweighted FE regression, and one mixed zero/positive-weight test (ideally via SurveyDesign.subpopulation()) proving excluded rows do not affect treatment-constancy checks or cell counts.

Path to Approval

Make the survey TWFE diagnostic mathematically consistent: when w_gt exists, use it instead of n_gt for the FE regressions, treated-side normalization denominator, exported contribution weights, and sigma_fe shares; otherwise reject survey-backed TWFE diagnostics.
Make zero-weight observations truly out-of-sample for dCDH survey aggregation by excluding them from d_min/d_max/n_gt and the within-cell treatment-constancy check.
Add regression tests for the two behaviors above so CI fails if survey TWFE weighting or zero-weight-row handling regresses again.

igerber · 2026-04-16T22:22:46Z

/ai-review

github-actions · 2026-04-16T22:30:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: aeca1d8e5a6fd7e6202c2e0300e5284ba189c3fa

Overall Assessment

✅ Looks good

Static review only. I could not execute the test suite in this shell because the available Python environment is missing core deps like numpy/pandas, so this assessment is based on source review of the diff and surrounding code.

Executive Summary

The two prior re-review blockers appear resolved: zero-weight observations are now filtered before cell validation/counting in diff_diff/chaisemartin_dhaultfoeuille.py:L212, and the survey TWFE diagnostic now uses w_gt throughout the FE regressions, normalization, and sigma_fe shares in diff_diff/chaisemartin_dhaultfoeuille.py:L4843.
Those fixes are backed by targeted regressions in tests/test_survey_dcdh.py:L733 and tests/test_survey_dcdh.py:L791.
The main survey deviations from the dCDH papers, psi_i = U[g] * (w_i / W_g) and the survey+bootstrap fallback, are explicitly documented in docs/methodology/REGISTRY.md:L651, so they are P3-informational rather than defects under the review rules.
I did not find any new unmitigated P0/P1 issues in the changed estimator, inference, or HonestDiD propagation paths.
I did find two P3 items: the standalone twowayfeweights() helper still does not mirror fit()’s replicate-weight rejection, and two touched docs still describe the old NotImplementedError gating behavior.

Methodology

Severity P3. Impact: the observation-level survey IF expansion and the group-level bootstrap fallback are documented deviations in docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L653. Per the review rules, these are accepted library extensions rather than methodology defects. Concrete fix: none.

Code Quality

Severity P3. Impact: fit() explicitly rejects replicate-weight survey designs in diff_diff/chaisemartin_dhaultfoeuille.py:L656, but twowayfeweights() only checks weight_type and otherwise proceeds in diff_diff/chaisemartin_dhaultfoeuille.py:L5055. That leaves the new fit/helper parity contract incomplete for unsupported replicate-weight inputs. Concrete fix: add the same NotImplementedError guard to twowayfeweights(), or explicitly document that the helper ignores replicate columns and is full-sample-pweight-only.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: two touched docs still describe the pre-PR gating behavior. docs/llms-full.txt:L273 still says “Phase 3 parameters raise NotImplementedError,” and ROADMAP.md:L194 still says the Phase 2/3 fit() params raise until implemented, even though survey support and the other Phase 3 surfaces are now live and docs/methodology/REGISTRY.md:L630 reflects that. Concrete fix: update both lines to say that only aggregate remains gated.
Severity P3. Impact: the new tests cover the repaired TWFE/zero-weight paths and the non-pweight helper rejection in tests/test_survey_dcdh.py:L714, tests/test_survey_dcdh.py:L733, and tests/test_survey_dcdh.py:L791, but there is still no negative test for a replicate-weight survey_design passed to twowayfeweights(). Concrete fix: add a helper-level regression that asserts the same rejection or explicitly documented behavior as fit().

igerber · 2026-04-16T22:32:54Z

/ai-review

github-actions · 2026-04-16T22:41:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8b6a6b1fcfd7191dcfcfa6240522fb72e6b065cb

Overall Assessment

✅ Looks good

Static review only. I could not run the test suite in this shell because pytest is unavailable, but the touched Python files parse successfully. The prior re-review items appear addressed: zero-weight rows are filtered before weighted cell validation/counting in diff_diff/chaisemartin_dhaultfoeuille.py:L212-L247, the survey TWFE diagnostic now uses w_gt consistently in diff_diff/chaisemartin_dhaultfoeuille.py:L4843-L4949, and twowayfeweights() now rejects replicate-weight designs in parity with fit() at diff_diff/chaisemartin_dhaultfoeuille.py:L5055-L5080.

Executive Summary

The methodology-sensitive survey extensions that depart from the dCDH papers are explicitly documented in the registry, especially the IF expansion and survey+bootstrap fallback at docs/methodology/REGISTRY.md:L651-L653, so I did not treat them as defects.
The survey-aware heterogeneity path is also documented and the implementation matches that stated contract in docs/methodology/REGISTRY.md:L618-L618 and diff_diff/chaisemartin_dhaultfoeuille.py:L3232-L3408.
HonestDiD now correctly receives df_survey from dCDH results via diff_diff/honest_did.py:L970-L988.
I did not find any unmitigated P0/P1 issue in the changed estimator, weighting, variance/SE, control-pool logic, or inference surfaces.
One P3 documentation issue remains: the dCDH results dataclass docstring still says survey_metadata is always None and survey support is deferred.

Methodology

Severity P3. Impact: The main survey variance path (psi_i = U[g] * (w_i / W_g)), the survey-aware heterogeneity regression, and the survey+bootstrap fallback are library extensions beyond the dCDH source papers, but they are explicitly documented in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653. Under the review rules, these are informational rather than defects. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: [Newly identified] The dCDH results docstring still says survey_metadata is always None and that survey support is deferred in diff_diff/chaisemartin_dhaultfoeuille_results.py:L344-L346, but this PR now populates survey_metadata on results in diff_diff/chaisemartin_dhaultfoeuille.py:L2545-L2545 and consumes it in HonestDiD at diff_diff/honest_did.py:L970-L988. Concrete fix: update that results-class docstring to describe the new survey-backed behavior.
Review note: I could not execute the test suite because pytest is not installed in this shell; this assessment is based on source review plus parse-level validation of the touched Python files.

igerber · 2026-04-16T22:42:21Z

/ai-review

github-actions · 2026-04-16T23:04:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2e6cbe278d493f23f727a5296a215e5c88ee26e7

Overall assessment

⚠️ Needs changes

Static review only. This shell does not have numpy/pytest, so I could not execute the test suite, but the touched Python files parse in-memory. The prior re-review’s docstring issue is fixed in diff_diff/chaisemartin_dhaultfoeuille_results.py.

Executive summary

The methodology-sensitive survey deviations are now documented in the registry, including the IF expansion and the survey+bootstrap fallback, so I am not treating those deviations themselves as defects. See REGISTRY.md and REGISTRY.md.
df_survey is now threaded through the main dCDH analytical surfaces and HonestDiD extraction; I did not find a new silent NaN/inference inconsistency there. See chaisemartin_dhaultfoeuille.py and honest_did.py.
P1: zero-weight rows are only filtered after core field validation, so survey subpopulation rows that are explicitly out-of-sample can still hard-fail the fit if they contain missing or malformed analysis values. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.
P1: the new survey_design parameter is not exercised on the still-live trends_linear, trends_nonparam, or design2 branches, even though those branches consume the same survey-weighted cell data and survey-aware inference plumbing. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.

Methodology

Severity P1. Impact: the new survey subpopulation contract is only partially implemented. _validate_and_aggregate_to_cells() validates group, time, treatment, and outcome on the full raw frame before the zero-weight filter runs, and later survey-backed validators for controls, trends_nonparam, and heterogeneity also scan full data. That means rows that are explicitly zero-weighted and meant to be out-of-sample can still raise ValueError, which breaks legitimate SurveyDesign.subpopulation() usage rather than ignoring excluded rows. Concrete fix: build a positive-weight mask once when survey_design is active and apply all content validation/time-invariance checks to that effective sample, not the raw frame; add regressions with zero-weight excluded rows containing NaN/non-numeric outcome, treatment, controls, set IDs, and heterogeneity values. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.
Severity P3. Impact: the survey IF expansion, survey-aware heterogeneity regression, and survey+bootstrap fallback are explicitly documented in the methodology registry as library extensions/deviations, so I am not treating those as defects under the review rubric. Concrete fix: none. See REGISTRY.md and REGISTRY.md.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I also did not find an existing mitigation entry for the two P1 items under TODO.md.

Security

No findings.

Documentation/Tests

Severity P1. Impact: the new survey test file covers core survey paths, but it does not exercise survey_design together with trends_linear=True, trends_nonparam=..., or design2=True. Because survey_design is a new public parameter and those branches still run through the same survey-weighted cell setup and inference surfaces, this leaves real interaction risk untested. Concrete fix: add one survey-backed regression each for trends_linear, trends_nonparam, and design2 that asserts the documented output contract (linear_trends_effects, trimmed event-study/placebo behavior, and design2_effects respectively) and survey-aware inference/metadata behavior where applicable. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.

Path to Approval

Apply the zero-weight subpopulation contract consistently: when survey_design is active, run outcome/treatment/group/time validation and the controls/trends_nonparam/heterogeneity validators on weights > 0 rows only.
Add regression tests where excluded zero-weight rows contain invalid analysis values, proving they no longer abort fit().
Add survey-backed tests for trends_linear=True, trends_nonparam=..., and design2=True, drop_larger_lower=False.

igerber · 2026-04-16T23:39:44Z

/ai-review

github-actions · 2026-04-16T23:49:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c1a590227ddc92c20e75fec641cef9209fdca6ea

Overall Assessment

⚠️ Needs changes

Static review only. I could not run the test suite in this environment because pytest is unavailable, and /usr/bin/python also lacks numpy, so the assessment below is from code inspection.

Executive Summary

The prior re-review’s survey interaction coverage gap for trends_linear, trends_nonparam, and design2 appears resolved by the new tests in tests/test_survey_dcdh.py:L895, tests/test_survey_dcdh.py:L922, and tests/test_survey_dcdh.py:L973.
The main zero-weight subpopulation fix is only partial. Outcome/treatment/set/heterogeneity validation now respects positive weights, but the new DID^X controls path still breaks when excluded zero-weight rows are present.
The methodology-sensitive survey deviations themselves are documented in docs/methodology/REGISTRY.md:L618, docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, and docs/methodology/REGISTRY.md:L653, so I am not counting those as defects.
I did not find a new NaN/Inf inference anti-pattern in the touched analytical surfaces.

Methodology

Severity P3 | Impact: The observation-level IF expansion, survey-aware heterogeneity WLS/TSL path, and the survey+bootstrap fallback are explicitly documented in docs/methodology/REGISTRY.md:L618, docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, and docs/methodology/REGISTRY.md:L653. Under the review rubric, these are documented deviations/extensions, not defects. Concrete fix: none.

Code Quality

Severity P1 | Impact: survey_design combined with controls still violates the zero-weight subpopulation contract. The controls branch validates only positive-weight rows at diff_diff/chaisemartin_dhaultfoeuille.py:L738, but then writes those shorter control arrays back into an unfiltered x_agg_input built from the full frame at diff_diff/chaisemartin_dhaultfoeuille.py:L764 and diff_diff/chaisemartin_dhaultfoeuille.py:L765. Any legitimate SurveyDesign.subpopulation() / zero-weight excluded row will therefore raise before covariate aggregation instead of being ignored, which breaks the newly documented survey DID^X path. Concrete fix: build x_agg_input from the same positive-weight mask, or keep full-length control arrays and scope only validation to the mask; in either case, centralize one effective-sample view and reuse it throughout the survey subpopulation-sensitive code.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a mitigating entry for the P1 above under TODO.md:L51, so it remains unmitigated for assessment purposes.

Security

No findings.

Documentation/Tests

Severity P2 | Impact: The new survey test file exercises controls only on all-positive-weight data at tests/test_survey_dcdh.py:L528, so it cannot catch the broken zero-weight DID^X interaction above. The added zero-weight regressions at tests/test_survey_dcdh.py:L816, tests/test_survey_dcdh.py:L845, and tests/test_survey_dcdh.py:L864 never enter the controls branch. Concrete fix: add a regression for survey_design + controls=["x"] + zero-weight excluded row, ideally with a missing or invalid control value on the excluded row, and assert that fit() succeeds.

Path to Approval

Fix the DID^X controls branch so the positive-weight filter and the frame used for covariate aggregation stay aligned under survey_design.
Add a regression test for survey_design + controls + zero-weight excluded row, preferably with malformed excluded control data to pin the subpopulation contract explicitly.

igerber · 2026-04-17T00:46:27Z

/ai-review

github-actions · 2026-04-17T00:53:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 36243059c0eb0640cbe5813af72b6d4e32103340

Overall Assessment

✅ Looks good

Static review only. I could not run the test suite here because pytest is unavailable and the default python lacks numpy, so this assessment is from diff inspection and local source review.

Executive Summary

The prior P1 on survey_design + controls + zero-weight excluded row appears resolved: the DID^X controls branch now validates and aggregates on the same positive-weight effective sample starting at diff_diff/chaisemartin_dhaultfoeuille.py:L732, and the missing regression was added at tests/test_survey_dcdh.py:L864.
The earlier survey interaction coverage gap for trends_linear, trends_nonparam, and design2 is now covered by new tests at tests/test_survey_dcdh.py:L917, tests/test_survey_dcdh.py:L944, and tests/test_survey_dcdh.py:L995.
I did not find a new NaN/Inf inference anti-pattern in the changed dCDH analytical surfaces.
The methodology-sensitive survey extensions that differ from the dCDH papers are explicitly documented in the registry, so under the review rubric they are P3 informational rather than defects: docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, docs/methodology/REGISTRY.md:L653.
df_survey propagation into dCDH HonestDiD is now wired in diff_diff/honest_did.py:L970 and covered by tests/test_survey_dcdh.py:L552.

Methodology

Severity P3 | Impact: The survey IF expansion and survey+bootstrap fallback are extensions relative to the cited dCDH papers, but both are explicitly documented in docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, and docs/methodology/REGISTRY.md:L653. Concrete fix: none.
Severity P3 | Impact: Survey-backed heterogeneity is also documented as a library extension in docs/methodology/REGISTRY.md:L618. I did not find an undocumented mismatch between the implementation and that registry note. Concrete fix: none.

Code Quality

No findings. The previous effective-sample misalignment in the DID^X controls path appears fixed at diff_diff/chaisemartin_dhaultfoeuille.py:L732 and is regression-tested at tests/test_survey_dcdh.py:L864.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see a new correctness limitation that needed TODO.md tracking.

Security

No findings.

Documentation/Tests

No findings. The new tests materially improve coverage for the changed survey paths, including zero-weight subpopulation handling, trends_linear, trends_nonparam, design2, and HonestDiD propagation at tests/test_survey_dcdh.py:L816, tests/test_survey_dcdh.py:L864, tests/test_survey_dcdh.py:L917, tests/test_survey_dcdh.py:L944, tests/test_survey_dcdh.py:L995, and tests/test_survey_dcdh.py:L552.
Residual risk: I could not execute the tests in this environment because pytest is not installed and the available python lacks numpy.

- Weighted cell aggregation in _validate_and_aggregate_to_cells() - Survey resolution via _resolve_survey_for_fit() with pweight-only and group-constant validation - IF expansion from group to observation level for TSL variance - Survey-aware SE at all call sites (overall, joiners, leavers, multi-horizon, placebos) via _compute_se() dispatcher - Bootstrap + survey warning (PSU-level deferred) - 12 new tests in test_survey_dcdh.py - Documentation updates: REGISTRY.md, ROADMAP.md, llms-full.txt, choosing_estimator.rst Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…torize IF expansion - Remove _validate_group_constant_survey() call - the IF expansion psi_i = U[g] * (w_i / W_g) handles observation-level variation in weights, strata, and PSU within groups correctly - Vectorize _survey_se_from_group_if using np.bincount + np.unique (was Python loops over all observations) - Replace test_rejects_varying_weights_within_group with two positive tests: varying weights accepted, and varying weights change ATT (time-varying noise to survive first-differencing) - Remove unused survey_weight_type variable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- P1-A: Scale U_centered by 1/divisor before survey IF expansion. dCDH IFs are numerator-scale (U.sum() == N_S * DID_M), but compute_survey_if_variance() expects estimator-scale psi. - P1-B: Zero-weight cells (w_gt <= 0) now treated as absent by setting n_gt=0, preventing NaN propagation into estimates. - P2: Add SE-pinning test (uniform weights + PSU=group matches plug-in SE) and zero-weight cell exclusion test. - P3: Delete unused _validate_group_constant_survey() from survey.py that contradicted the supported within-group variation contract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ey propagation - P1-A: Drop zero-weight rows entirely from cell DataFrame (instead of just zeroing n_gt) so ragged-panel validator doesn't see them - P1-B: Survey-weighted covariate aggregation in DID^X path (sum(w*x)/sum(w) instead of unweighted mean) - P1-C: Thread _df_survey to all remaining safe_inference() calls: bootstrap t-stats, normalized effects, cost-benefit delta, placebo bootstrap t-stats - P3: Fix REGISTRY overview paragraph (was still saying survey deferred) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- P0: delta overall surface now uses _df_survey instead of df=None at both safe_inference sites (primary delta path + placebo NaN-SE fallback). This makes overall_* under L_max>=2 use survey-t inference and respects safe_inference's df<=0 NaN guard. - P1: HonestDiD dCDH extraction now propagates df_survey from survey_metadata (mirrors CS pattern). Survey-backed dCDH HonestDiD bounds now use survey-aware critical values. - P2: Add 4 regressions (survey delta t-matches-reported, t-vs-z differs, survey+controls, survey+honest_did df propagation). Update stale comment in test_dcdh_extraction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- twowayfeweights() now raises NotImplementedError when the resolved survey design carries replicate weights, matching fit()'s contract. The fit/helper parity promise now holds for unsupported inputs too. - docs/llms-full.txt and ROADMAP.md no longer claim that Phase 3 parameters raise NotImplementedError; both now correctly note that only `aggregate` remains gated. - Added a helper-level regression that pins the replicate-weight rejection, complementing the existing fit()-level test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The results dataclass docstring still described survey_metadata as 'always None' and survey integration as deferred. survey_design is now implemented and results.survey_metadata is populated whenever a SurveyDesign is passed to fit(). Docstring now describes the populated field and its role in survey-aware inference and HonestDiD propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R8's controls-block fix scoped NaN/Inf validation to the positive-weight subset (shorter data_controls) but then assigned those shorter arrays into an x_agg_input built from the full-length frame, causing a length-mismatch on any SurveyDesign.subpopulation() / zero-weight excluded row before covariate aggregation could run. Root-caused fix: derive both the validation frame AND the aggregation frame from the same positive-weight effective sample (data_eff, survey_weights_eff). Zero-weight rows are genuinely out-of-sample throughout the DID^X path now. Non-survey fits unchanged. Added TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_control pinning the subpopulation contract for the DID^X path — injects a zero-weight row with NaN control value and asserts fit() succeeds. All 263 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- diff_diff/guides/llms-practitioner.txt: add reversible-treatment branch to the Step 4 decision tree pointing at ChaisemartinDHaultfoeuille (alias DCDH), and a short Step 5 example showing the survey_design + L_max + TSL fallback for dCDH. The practitioner guide previously had zero dCDH coverage despite the full API being live. - diff_diff/guides/llms-full.txt: fix stale SyntheticDiDResults table (line 1024) — variance_method now accepts "bootstrap", "jackknife", or "placebo" (jackknife was added for synthdid::vcov parity but the table only listed bootstrap/placebo). Bundled-guide paths only. The old docs/llms-*.txt were moved to diff_diff/guides/ in main (PR #310); these edits target the new canonical location consumed by get_llm_guide(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-17T10:10:56Z

/ai-review

github-actions · 2026-04-17T10:26:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d0459878da098c4467398f68846bda0413359e7d

Overall Assessment

⚠️ Needs changes

Static review only. I could not run pytest here because pytest is not installed and the local python environment is missing numpy.

Executive Summary

The prior re-review items look resolved: the zero-weight DID^X effective-sample fix is now in chaisemartin_dhaultfoeuille.py:732, and dCDH HonestDiD now propagates df_survey in honest_did.py:970, with matching regressions at test_survey_dcdh.py:864 and test_survey_dcdh.py:552.
The registry now documents the high-level survey IF expansion and the survey+bootstrap fallback in REGISTRY.md:651, REGISTRY.md:652, and REGISTRY.md:653, so those deviations themselves are informational rather than blockers.
Two new P1 methodology issues remain: the survey SE code for DID_l / dynamic placebos and for the Lemma 7 heterogeneity regression spreads each group-level IF across all rows in the group, not the cell-period rows that actually enter the estimand.
A separate P1 edge-case issue remains: zero-weight excluded rows still leak into survey SE factorization through raw group_ids, so valid subpopulation inputs can still fail on mixed-type / missing group IDs.
The new tests materially improve coverage for zero-weight rows, trends_linear, trends_nonparam, design2, and HonestDiD, but they still keep PSU/strata constant within group, so the highest-risk new survey path is not exercised.

Methodology

Affected methods: survey-backed DID_l, DID_l^{pl}, any top-level overall_* surface derived from them when L_max >= 1, and the survey-backed Lemma 7 heterogeneity regression. The dynamic paper’s variance object is time-indexed, with U^G_{g,\ell} built from period-specific coefficients on Y_{g,t}, and Binder-style TSL is meant to linearize the estimator through its actual estimating contributions. (nber.org)

Severity P1 | Impact: Even accepting the documented survey-extension note in REGISTRY.md:652, the implementation drops the paper’s time dimension before variance estimation. fit() stores only raw group_ids and weights in _obs_survey_info, and _survey_se_from_group_if() then expands each group IF as U[g] * (w_i / W_g) across every row in that group in chaisemartin_dhaultfoeuille.py:4728. That helper is what supplies survey SEs for multi-horizon effects in chaisemartin_dhaultfoeuille.py:1531 and placebo horizons in chaisemartin_dhaultfoeuille.py:1658. Because the paper’s U^G_{g,\ell} is built from period-specific λ_{g,t,\ell} Y_{g,t} terms, rows from unused periods should not be able to change a horizon-specific SE just by moving across PSU/strata; here they can. Concrete fix: keep per-row time or cell IDs in _obs_survey_info and expand the exact (g,t) coefficients onto observations as a_{g,t} * w_i / W_{g,t} before calling compute_survey_if_variance(). (nber.org)
Severity P1 | Impact: The survey heterogeneity path has the same defect. _compute_heterogeneity_test() defines the horizon-l regression on S_g * (Y_{g,F_g-1+l} - Y_{g,F_g-1}) in chaisemartin_dhaultfoeuille.py:3223 and builds dep_var from only ref_idx / out_idx in chaisemartin_dhaultfoeuille.py:3303, but the survey branch again allocates psi_g to all rows in the group in chaisemartin_dhaultfoeuille.py:3415. That makes heterogeneity_effects[l]["se"] and its t-based inference depend on periods that never enter the Lemma 7 regression. Concrete fix: allocate the WLS score only across observations in the reference and outcome cells, or store exact per-cell WLS score contributions and linearize those.
Severity P3 | Impact: The existence of the survey IF expansion and the survey+bootstrap fallback are explicitly documented in REGISTRY.md:651, REGISTRY.md:652, and REGISTRY.md:653, so I did not count those high-level deviations themselves as defects. Concrete fix: none.

Code Quality

Severity P1 | Impact: Zero-weight excluded rows are correctly skipped during cell validation, but they are not fully sanitized out of the survey SE path. _obs_survey_info still captures raw data[group] in chaisemartin_dhaultfoeuille.py:704, and _survey_se_from_group_if() immediately runs np.unique(group_ids, return_inverse=True) on that raw array in chaisemartin_dhaultfoeuille.py:4772. With string-keyed groups, a zero-weight row whose group is NaN or otherwise non-comparable can still fail inside SE factorization even though the row is supposed to be out-of-sample under SurveyDesign.subpopulation(). The new zero-weight tests cover bad outcome, control, and heterogeneity values, but not bad group IDs in test_survey_dcdh.py:813, test_survey_dcdh.py:845, test_survey_dcdh.py:864, and test_survey_dcdh.py:886. Concrete fix: build the group-total factorization from a positive-weight mask, or use a dict/groupby accumulation that ignores zero-weight rows and then zero-pads psi back to full survey length; add a regression with string group IDs plus a zero-weight row whose group is NaN.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see a new TODO.md entry that would mitigate the unaddressed P1 issues above.

Security

No findings.

Documentation/Tests

Severity P2 | Impact: The implementation explicitly claims the proportional IF expansion “handles observation-level variation in weights, strata, and PSU within groups” in chaisemartin_dhaultfoeuille.py:669, but the new fixture still assigns strata and PSU constant within group in test_survey_dcdh.py:32, and the only within-group variation that is exercised is weights in test_survey_dcdh.py:267. That leaves the most methodology-sensitive new branch untested. Concrete fix: add a survey fixture with within-group varying PSU/strata and assert that off-horizon rows do not change the reported SE once the cell-level IF expansion is corrected.

Path to Approval

Replace the current group-wide survey IF allocator with a cell-period allocator for DID_l / DID_l^{pl} / top-level overall_* surfaces on L_max >= 1, so only the (g,t) cells that actually enter the estimator receive IF mass.
Apply the same cell-aware score allocation to the survey heterogeneity path, so heterogeneity_effects[l]["se"] is driven only by the reference and outcome cells used in the horizon-l WLS regression.
Sanitize zero-weight excluded rows before group-ID factorization in _survey_se_from_group_if() and add a regression with string group IDs plus a zero-weight NaN group row.
Add tests with within-group varying PSU/strata and with off-horizon rows moved across PSUs to pin the corrected survey-SE contract.

- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-17T11:04:17Z

/ai-review

github-actions · 2026-04-17T11:12:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0266d59ac66cb2e947fee3929121353a830aad38

Overall Assessment

⚠️ Needs changes

Static review only. I could not run pytest here because the local Python environment is missing numpy.

Executive Summary

The prior zero-weight SE-factorization issue looks fixed: _survey_se_from_group_if() now removes zero-weight rows before group-ID factorization, so excluded NaN / mixed-type group IDs no longer reach np.unique diff_diff/chaisemartin_dhaultfoeuille.py:4826-4852.
The prior HonestDiD df_survey propagation issue also looks fixed: dCDH results now pass survey_metadata.df_survey into HonestDiD extraction diff_diff/honest_did.py:970-977.
The previous within-group PSU/strata contamination case is partially addressed: mixed PSU/strata within a group are now rejected, and that restriction is documented diff_diff/chaisemartin_dhaultfoeuille.py:669-677, docs/methodology/REGISTRY.md:651-653.
[Newly identified] weights-only and stratified-no-PSU dCDH survey fits still contradict that same methodology note: the implementation says the group is the effective sampling unit, but the shared survey variance code falls back to implicit per-observation PSUs when psu is omitted. That affects SEs, df_survey, and all t-based inference surfaces on a supported input class.
The new tests materially improve coverage, but they currently codify the conflicting no-PSU behavior instead of catching it.
The high-level survey IF extension and the survey+bootstrap fallback are documented in REGISTRY.md, so I did not count those deviations themselves as defects.

Methodology

Severity P1 | Impact: [Newly identified] Supported dCDH survey fits with SurveyDesign(weights=...) and no explicit PSU still use implicit per-observation PSUs, even though the new dCDH methodology note and fit-time validation say the survey IF expansion treats each group as the effective sampling unit. fit() stores the resolved design unchanged diff_diff/chaisemartin_dhaultfoeuille.py:705-713; _survey_se_from_group_if() and the heterogeneity survey path expand group IFs to observation level and pass that unchanged design to compute_survey_if_variance() diff_diff/chaisemartin_dhaultfoeuille.py:4796-4858, diff_diff/chaisemartin_dhaultfoeuille.py:3383-3433; and the shared survey helper explicitly treats psu=None as per-observation PSUs diff_diff/survey.py:1201-1218. That directly conflicts with the estimator-specific registry note docs/methodology/REGISTRY.md:651-652, and it changes overall_se, event-study/placebo SEs, heterogeneity SEs, survey_metadata.df_survey, and downstream t-based inference/HonestDiD critical values on no-PSU designs. Concrete fix: create a dCDH-specific effective survey design that injects group as the PSU whenever survey_design.psu is None (preserving strata/FPC), use that effective design for every compute_survey_if_variance() call, and compute survey_metadata.df_survey from the same effective design; otherwise reject no-PSU dCDH survey designs explicitly.
Severity P3 | Impact: The survey IF expansion itself (psi_i = U[g] * (w_i / W_g)) and the group-level bootstrap fallback are explicitly documented dCDH library extensions, so those deviations are informational rather than blockers docs/methodology/REGISTRY.md:651-653. Concrete fix: none.

Code Quality

No additional findings beyond the methodology issue above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No mitigating TODO.md entry covers the P1 above; the existing dCDH TODO items are unrelated TODO.md:55-60.

Security

No findings.

Documentation/Tests

Severity P2 | Impact: The new tests currently lock in the conflicting no-PSU contract instead of guarding the documented dCDH one. test_uniform_survey_se_matches_plugin explicitly states that, without psu="group", the survey path should treat each observation as independent tests/test_survey_dcdh.py:388-395, and many new survey-path smoke tests exercise SurveyDesign(weights="pw") with no oracle for unused-period row-count invariance tests/test_survey_dcdh.py:347-358, tests/test_survey_dcdh.py:937-944, tests/test_survey_dcdh.py:1015-1026. Concrete fix: replace/add regressions for weights-only and stratified-no-PSU fits that change unused-period row counts within a group while holding cell means fixed, then assert unchanged SEs/df_survey; if no-PSU designs are made unsupported instead, add explicit rejection tests and update docs accordingly.

Path to Approval

Make dCDH build an effective survey design with psu=group whenever survey_design.psu is None, and derive survey_metadata.df_survey from that effective design.
Use that same effective design in every dCDH survey inference path: _survey_se_from_group_if(), the heterogeneity survey branch, and any downstream consumer of survey_metadata.df_survey such as HonestDiD.
Add regression tests for no-PSU survey fits showing that unused-period row duplication within a group does not change overall_se, event-study/placebo SEs, heterogeneity SEs, or df_survey once cell means and group IFs are unchanged; if you choose to reject no-PSU dCDH designs, replace those with explicit ValueError tests and align the docs.

igerber and others added 11 commits April 17, 2026 06:04

igerber force-pushed the feature/dcdh-survey branch from 3624305 to d045987 Compare April 17, 2026 10:10

Conversation

igerber commented Apr 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

igerber commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

igerber commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

igerber commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant