Skip to content

Add survey support to dCDH estimator#307

Open
igerber wants to merge 13 commits intomainfrom
feature/dcdh-survey
Open

Add survey support to dCDH estimator#307
igerber wants to merge 13 commits intomainfrom
feature/dcdh-survey

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 16, 2026

Summary

  • Add survey support (pweight + strata/PSU/FPC) to ChaisemartinDHaultfoeuille via Taylor Series Linearization
  • Weighted cell aggregation: survey weights produce weighted cell means for the point estimate
  • IF expansion from group to observation level for design-based variance (library extension, documented in REGISTRY.md)
  • Survey-aware SE at all call sites: overall, joiners, leavers, multi-horizon, placebos
  • Bootstrap + survey emits warning (PSU-level bootstrap deferred)
  • Observation-level varying weights/PSU/strata within groups are supported
  • 13 new tests in test_survey_dcdh.py; 171 total dCDH tests pass
  • Documentation updates: REGISTRY.md, ROADMAP.md, llms-full.txt, choosing_estimator.rst

Methodology references (required if estimator / math changes)

  • Method name(s): ChaisemartinDHaultfoeuille survey extension
  • Paper / source link(s): de Chaisemartin & D'Haultfoeuille (2020, 2024); Binder (1983) for TSL variance
  • Any intentional deviations from the source (and why): The IF expansion psi_i = U[g] * (w_i / W_g) is a library extension not in the dCDH papers (the paper assumes iid sampling). Documented as a Note: in REGISTRY.md.

Validation

  • Tests added/updated: tests/test_survey_dcdh.py (new, 13 tests), tests/test_chaisemartin_dhaultfoeuille.py (updated gate test)
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the test suite in this shell because the available python is missing numpy.

Executive Summary

  • Affected method: the new survey extension for ChaisemartinDHaultfoeuille touches DID_M, DID_+, DID_-, multi-horizon DID_l, and multi-horizon placebo inference.
  • The new REGISTRY note documents the observation-level IF expansion as a library extension, so I am not treating the existence of that deviation as a defect by itself docs/methodology/REGISTRY.md:651.
  • P1: the analytical survey SE path appears to drop the N_S / N_l normalization and feeds numerator-scale dCDH group IFs into compute_survey_if_variance(), which is documented to expect estimator-scale psi.
  • P1: weighted cell aggregation has no zero-total-weight guard, but downstream availability/control logic still keys off raw n_gt, so zero-weight cells can be treated as valid switcher/control cells with undefined y_gt.
  • P2: the new tests do not pin survey SE scale or zero-weight-cell behavior, so both blocking issues can pass CI.
  • P3: survey.py now contains an unused helper whose docstring contradicts the new supported contract on within-group survey variation.

Methodology

The documented extension itself is acceptable; the blockers below are implementation errors inside that extension.

  • Severity: P1. Impact: the survey SE path is mis-scaled. dCDH’s own IF builders return numerator-scale contributions, not estimator-scale IFs: _compute_full_per_group_contributions() documents U.sum() == N_S * DID_M, and _compute_per_group_if_multi_horizon() documents sum(U_l) == N_l * DID_l diff_diff/chaisemartin_dhaultfoeuille.py:4242 diff_diff/chaisemartin_dhaultfoeuille.py:3686. _plugin_se() then divides by divisor to get the estimator SE diff_diff/chaisemartin_dhaultfoeuille.py:4494. But _compute_se() ignores divisor on the survey branch, and _survey_se_from_group_if() expands observations so that sum_i psi_i = U[g] before calling compute_survey_if_variance() diff_diff/chaisemartin_dhaultfoeuille.py:4535 diff_diff/chaisemartin_dhaultfoeuille.py:4558. The shared survey contract says compute_survey_if_variance() expects score/estimator-scale psi already normalized, analogous to w_i * eif_i / sum(w) diff_diff/survey.py:1504 docs/methodology/survey-theory.md:489. As written, the code is estimating the variance of N_* * estimand, not the estimand, across overall/joiners/leavers/event-study/placebo analytical outputs. Concrete fix: scale U_centered by 1 / divisor before expansion, or equivalently divide the final sqrt(variance) by divisor, and add a regression test on a hand-checkable panel.
  • Severity: P1. Impact: zero-weight cells are not handled safely. SurveyDesign.resolve() explicitly allows zero weights as long as not all weights are zero diff_diff/survey.py:173. The new weighted aggregation computes y_gt = _wy_sum / w_gt with no guard for w_gt <= 0 diff_diff/chaisemartin_dhaultfoeuille.py:217. Downstream, the estimator still builds N_mat from raw n_gt counts and uses N_mat > 0 for presence/A11/control-pool logic diff_diff/chaisemartin_dhaultfoeuille.py:1032 diff_diff/chaisemartin_dhaultfoeuille.py:2701. That means a cell with zero survey weight can still count as an observed joiner/leaver/control even though its weighted mean is undefined, which breaks the new survey control-availability logic and can propagate NaNs into estimates. Concrete fix: reject any (group, time) cell with w_gt <= 0 before pivoting, or redefine “present” everywhere on positive survey weight rather than raw observation count.

Code Quality

No material findings beyond the methodology issues above.

Performance

No material findings in the changed code.

Maintainability

  • Severity: P3. Impact: _validate_group_constant_survey() says dCDH survey columns “must not vary within groups” diff_diff/survey.py:914, but fit() now documents the opposite and the new tests assert that varying weights within groups are supported diff_diff/chaisemartin_dhaultfoeuille.py:653 tests/test_survey_dcdh.py:267. Because the helper is unused, this is not a current runtime bug, but it is dead contradictory guidance that can cause a future regression. Concrete fix: delete the helper or rewrite it to match the actual supported contract.

Tech Debt

  • Severity: P3. Impact: these blockers are not mitigated by existing tracked debt. The dCDH entries in TODO.md cover placebo SE and parity limitations, not survey SE scaling or zero-weight-cell handling TODO.md:59. Concrete fix: fix these in the PR; a TODO entry would not downgrade them because they are correctness issues.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new test file would not catch either blocker. test_uniform_weights_match_unweighted() only checks ATT equality, not SE scale tests/test_survey_dcdh.py:69, and test_strata_psu_changes_se() only checks that two SEs differ, not that either is correct tests/test_survey_dcdh.py:192. There is also no zero-weight-cell test in the new suite. Concrete fix: add one hand-checkable analytical survey-SE test that pins the divisor normalization, plus a zero-weight-cell test that asserts explicit failure or consistent exclusion.

Path to Approval

  1. Normalize the survey IF passed to compute_survey_if_variance() by the relevant dCDH divisor (N_S, joiner_total, leaver_total, N_l, N_pl_l) for every analytical survey SE path.
  2. Add explicit handling for w_gt <= 0 cells and make the presence/control-availability logic consistent with that rule.
  3. Add regression tests for the two cases above, plus at least one survey multi-horizon/placebo inference case so the affected analytical survey paths are covered.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1ed49bce753bbee3d52cfa613b9e89b315d174cf


Overall Assessment
⚠️ Needs changes

Static review only; I could not run pytest here because this shell is missing numpy and pytest.

Executive Summary

  • Affected methods: the new survey path touches DID_M, DID_+, DID_-, multi-horizon DID_l, multi-horizon placebos, and derived delta / DID^n_l / DID^{fd} inference in diff_diff/chaisemartin_dhaultfoeuille.py:505.
  • The REGISTRY note documenting the observation-level IF expansion psi_i = U[g] * (w_i / W_g) is sufficient; I am not treating that library extension itself as a defect docs/methodology/REGISTRY.md:651.
  • The prior survey-SE normalization blocker appears fixed: _compute_se() now rescales numerator-scale dCDH IFs before calling the survey TSL helper, and the new scale-pinning test matches that intent diff_diff/chaisemartin_dhaultfoeuille.py:4541, tests/test_survey_dcdh.py:388.
  • The earlier zero-weight-cell issue is only partially addressed: zero-weight cells no longer count in N_mat, but the ragged-panel validator still treats their rows as observed periods.
  • [Newly identified] survey support is not fully propagated to controls or to derived analytical inference surfaces (delta, normalized_effects, linear_trends_effects).
  • The new tests add good coverage for the main survey path, but they do not cover the blocking interactions above.

Methodology
The documented deviation itself is acceptable: the survey IF expansion is recorded in the Methodology Registry as a **Note:** and aligns with the shared compute_survey_if_variance() score-scale contract docs/methodology/REGISTRY.md:652, diff_diff/survey.py:1463.

Code Quality

  • No additional findings beyond the methodology propagation issues above.

Performance

  • No material performance findings in the changed code.

Maintainability

  • Severity: P3. Impact: the dCDH REGISTRY overview still says survey support is “deferred,” while the checklist later says it shipped, so the designated methodology source of truth is internally inconsistent docs/methodology/REGISTRY.md:466, docs/methodology/REGISTRY.md:651. Concrete fix: update the overview paragraph and add an explicit dCDH survey-bootstrap **Note:** in the method section.

Tech Debt

  • Severity: P3. Impact: the blocking items above are not mitigated by tracked debt; the existing dCDH TODO entries only cover single-period placebo SE and parity-test coverage TODO.md:51, TODO.md:59. Concrete fix: resolve them in this PR rather than deferring.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new suite exercises overall survey fitting, SE scale pinning, and a basic zero-weight case, but it does not cover survey_design + controls, survey L_max>1 top-level delta inference, survey linear_trends_effects, or zero-weight baseline/interior-gap behavior; test_zero_weight_cell_excluded currently encodes the inconsistent expectation that a zero-weight interior cell should “just be absent” without triggering the documented ragged-panel handling tests/test_survey_dcdh.py:267, tests/test_survey_dcdh.py:347, tests/test_survey_dcdh.py:388, tests/test_survey_dcdh.py:427. Concrete fix: add targeted tests for those paths and assert either agreement with explicit row removal or the documented warning/error contract.

Path to Approval

  1. Make zero-weight cells fully obey the “absent cell” contract before Step 5b: either drop w_gt <= 0 rows immediately after aggregation or make baseline/interior-gap validation key off n_gt > 0. Add one test for a zero-weight baseline cell and one for a zero-weight interior-gap cell.
  2. Apply survey-weighted cell aggregation to controls in the DID^X path. Add a regression test with duplicated observations and within-cell-varying weights where weighted and unweighted cell covariates differ.
  3. Propagate resolved_survey.df_survey into the safe_inference() calls for cost_benefit_delta, normalized_effects, and linear_trends_effects, and add small-PSU tests that verify those outputs use survey t-based inference rather than z-based inference.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cf17cb2fdeb2a87807b5dff6b24d24e18bb5b43b


Overall Assessment
⛔ Blocker

Static review only; this shell does not have pytest or numpy, so I could not execute the new survey tests.

Executive Summary

Methodology
Methods affected: dCDH survey support via Binder-style TSL on the dCDH IF, Phase 2 delta, and dCDH HonestDiD integration. The documented survey IF expansion is covered by the registry note and is not a defect. docs/methodology/REGISTRY.md:L651-L652

  • Severity: P0. Impact: for survey-backed L_max >= 2 fits, the public overall_* surface is delta, but its inference still calls safe_inference(..., df=None) even though the same fit threads _df_survey into the other analytical survey surfaces. That silently reverts delta to z-based inference, and when df_survey <= 0 it bypasses safe_inference()’s NaN guard and can return finite overall_p_value / overall_conf_int where survey inference should be undefined. diff_diff/chaisemartin_dhaultfoeuille.py:L1740-L1775 diff_diff/chaisemartin_dhaultfoeuille.py:L2146-L2150 diff_diff/utils.py:L152-L182 Concrete fix: pass _df_survey into the safe_inference(delta_val, delta_se, ...) call and add regressions for both low positive survey d.f. and df_survey=0.
  • Severity: P1 [Newly identified]. Impact: fit(..., honest_did=True, survey_design=...) now reaches compute_honest_did(results), but the dCDH extraction branch still hard-codes df_survey=None with the comment “dCDH has no survey support”; the existing HonestDiD test locks that behavior in. That means survey-backed dCDH HonestDiD bounds silently use non-survey critical values instead of the survey-d.f. logic already used for other estimators. diff_diff/chaisemartin_dhaultfoeuille.py:L2538-L2545 diff_diff/honest_did.py:L970-L977 tests/test_honest_did.py:L1129-L1222 tests/test_honest_did.py:L1372-L1384 Concrete fix: return results.survey_metadata.df_survey for dCDH in _extract_event_study_params() (or explicitly reject the survey_design + honest_did combination until supported) and add a survey-backed dCDH HonestDiD regression.

Code Quality

  • No additional findings beyond the inference-propagation issues above.

Performance

  • No material performance findings in the changed code.

Maintainability

  • No additional findings beyond the stale HonestDiD survey assumption already called out above.

Tech Debt

  • No mitigating TODO.md entry covers the remaining survey-delta or survey-HonestDiD issues, so they are unmitigated and should not be deferred in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new survey suite stops at “multi-horizon runs”, bootstrap warning, SE scale pinning, and zero-weight behavior; it does not assert survey inference on top-level delta, does not exercise survey + controls, and does not cover survey + honest_did=True. The stale HonestDiD extraction test still asserts df_s is None for dCDH, which is why the new survey/HonestDiD propagation hole survived. tests/test_survey_dcdh.py:L345-L444 tests/test_honest_did.py:L1372-L1384 Concrete fix: add targeted regressions for survey delta inference (overall_* when L_max=2), survey + controls, and survey + honest_did=True, and replace the stale df_s is None expectation with survey-aware assertions.

Path to Approval

  1. Thread _df_survey into the top-level delta safe_inference() call and add regressions showing that overall_p_value / overall_conf_int match survey-t inference for small positive d.f. and become NaN when df_survey <= 0.
  2. Make dCDH HonestDiD survey-aware by propagating results.survey_metadata.df_survey through _extract_event_study_params() (or by rejecting survey_design + honest_did explicitly), and add a survey-backed dCDH HonestDiD test to replace the stale df_s is None expectation.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4eca2322577294d8bfaefbb3c9d2f5a26bc5e4c4


Overall Assessment

⚠️ Needs changes

Static review only; this shell does not have numpy, so I could not execute the new survey tests.

Executive Summary

Methodology

Affected methods: the core dCDH survey extension, the Phase 3 heterogeneity test, and the TWFE diagnostic surface.

  • Severity: P1 [Newly identified]. Impact: survey_design + heterogeneity is now accepted, but the heterogeneity branch is still plain OLS with non-survey inference. fit() only rejects controls, trends_linear, and trends_nonparam; it does not reject survey-backed heterogeneity. _compute_heterogeneity_test() then calls solve_ols(..., return_vcov=True) with no weights and finishes with safe_inference(..., df=None), so results.heterogeneity_effects silently ignore the survey design even though the main dCDH surface is survey-aware and the registry’s heterogeneity note does not list survey as unsupported. diff_diff/chaisemartin_dhaultfoeuille.py:L2333-L2395 diff_diff/chaisemartin_dhaultfoeuille.py:L3279-L3292 docs/methodology/REGISTRY.md:L618-L618 Concrete fix: either reject heterogeneity when survey_design is present, or implement a survey-aware heterogeneity regression and document that contract in REGISTRY.md.
  • Severity: P1 [Newly identified]. Impact: survey support breaks the documented parity between the fitted TWFE diagnostic and the standalone helper. fit() now aggregates cells with weights=survey_weights, so results.twfe_* can depend on survey-weighted cell means, but twowayfeweights() still has no survey_design input and always uses unweighted aggregation. The registry and inline comments still say the fitted and standalone APIs “always produce identical numbers on the same input,” which is no longer true for survey-backed fits. This is an incomplete parameter propagation across a public helper. diff_diff/chaisemartin_dhaultfoeuille.py:L679-L685 diff_diff/chaisemartin_dhaultfoeuille.py:L766-L772 diff_diff/chaisemartin_dhaultfoeuille.py:L4887-L4944 docs/methodology/REGISTRY.md:L600-L600 Concrete fix: either add survey_design support to twowayfeweights() and route the same weighted aggregation path, or explicitly disable/reject twfe_diagnostic=True under survey_design and update the registry note.

Code Quality

No additional findings beyond the methodology propagation issues above.

Performance

No material performance findings in the changed code.

Maintainability

Tech Debt

No relevant TODO.md entry mitigates the P1 items above; the tracked dCDH review debt only covers older placebo/parity limitations. TODO.md:L51-L60

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new survey suite now covers core survey ATT, bootstrap warning, delta, controls, and HonestDiD, but it still does not cover survey heterogeneity or survey TWFE diagnostic/helper parity. Existing heterogeneity and Design-2 tests remain non-survey only. That is why the two propagation gaps above can land without a failing regression. tests/test_survey_dcdh.py:L345-L575 tests/test_chaisemartin_dhaultfoeuille.py:L2775-L2940 Concrete fix: add one survey-backed heterogeneity test (or explicit rejection test) and one survey TWFE diagnostic test that checks either fit-vs-helper parity or the intended rejection/warning behavior.

Path to Approval

  1. Make survey_design + heterogeneity explicit and correct: either reject that combination in fit() with a clear ValueError and registry note, or implement survey-aware heterogeneity estimation/inference and add a regression test for it.
  2. Make the TWFE diagnostic surface consistent under survey input: either extend twowayfeweights() to accept survey_design and match the fitted path, or reject twfe_diagnostic=True when survey_design is present and update the registry contract accordingly.
  3. Add targeted survey regressions for the chosen behavior in items 1 and 2 so CI will fail if those propagation paths regress again.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 97789f5d5193522127e6cc8f5cffe2c2ca372c20


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the test suite in this shell because pytest is not installed.

Executive Summary

  • The prior re-review blockers look resolved: survey heterogeneity now threads _obs_survey_info into _compute_heterogeneity_test diff_diff/chaisemartin_dhaultfoeuille.py#L2386, the standalone helper now accepts survey_design diff_diff/chaisemartin_dhaultfoeuille.py#L5018, and df_survey now reaches both delta and HonestDiD diff_diff/chaisemartin_dhaultfoeuille.py#L2146, diff_diff/honest_did.py#L970.
  • The survey IF expansion psi_i = U[g] * (w_i / W_g) and the survey+bootstrap fallback are explicitly documented Notes in the registry, so I am not treating those library extensions as defects docs/methodology/REGISTRY.md#L651.
  • Severity P1: the new survey TWFE diagnostic is still a hybrid. Survey-weighted cell means are computed, but the FE regressions and Theorem 1 decomposition still use raw n_gt counts instead of w_gt, so twfe_weights, twfe_beta_fe, twfe_fraction_negative, and twfe_sigma_fe do not correspond to an observation-level pweighted TWFE estimator on survey-backed inputs.
  • Severity P1: zero-weight observations still participate in cell validation/counting. Because _validate_and_aggregate_to_cells() computes d_min, d_max, and n_gt on all rows before only dropping all-zero cells, valid survey/subpopulation inputs can spuriously trip the fuzzy-DiD guard or leak excluded rows into downstream counts.
  • Severity P2: the new survey tests mostly prove path execution and fit/helper parity, not weighted-correctness or mixed zero-weight-row behavior, so the two P1s above can still pass.

Methodology

Code Quality

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3. Impact: neither P1 issue is tracked in TODO.md, so neither is mitigated by the project’s deferred-work policy TODO.md#L51. Concrete fix: address them in this PR rather than deferring them.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: the new tests do not pin the survey math that changed. TestSurveyTWFEParity only checks fit-vs-helper parity tests/test_survey_dcdh.py#L685, TestZeroWeightCells only checks an all-zero cell drop tests/test_survey_dcdh.py#L427, and the varying-weight test only exercises positive weights tests/test_survey_dcdh.py#L270. That is why the two survey-weight propagation bugs above can pass. Concrete fix: add one oracle test comparing twfe_beta_fe / decomposition weights to an observation-level pweighted FE regression, and one mixed zero/positive-weight test (ideally via SurveyDesign.subpopulation()) proving excluded rows do not affect treatment-constancy checks or cell counts.

Path to Approval

  1. Make the survey TWFE diagnostic mathematically consistent: when w_gt exists, use it instead of n_gt for the FE regressions, treated-side normalization denominator, exported contribution weights, and sigma_fe shares; otherwise reject survey-backed TWFE diagnostics.
  2. Make zero-weight observations truly out-of-sample for dCDH survey aggregation by excluding them from d_min/d_max/n_gt and the within-cell treatment-constancy check.
  3. Add regression tests for the two behaviors above so CI fails if survey TWFE weighting or zero-weight-row handling regresses again.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: aeca1d8e5a6fd7e6202c2e0300e5284ba189c3fa


Overall Assessment

✅ Looks good

Static review only. I could not execute the test suite in this shell because the available Python environment is missing core deps like numpy/pandas, so this assessment is based on source review of the diff and surrounding code.

Executive Summary

  • The two prior re-review blockers appear resolved: zero-weight observations are now filtered before cell validation/counting in diff_diff/chaisemartin_dhaultfoeuille.py:L212, and the survey TWFE diagnostic now uses w_gt throughout the FE regressions, normalization, and sigma_fe shares in diff_diff/chaisemartin_dhaultfoeuille.py:L4843.
  • Those fixes are backed by targeted regressions in tests/test_survey_dcdh.py:L733 and tests/test_survey_dcdh.py:L791.
  • The main survey deviations from the dCDH papers, psi_i = U[g] * (w_i / W_g) and the survey+bootstrap fallback, are explicitly documented in docs/methodology/REGISTRY.md:L651, so they are P3-informational rather than defects under the review rules.
  • I did not find any new unmitigated P0/P1 issues in the changed estimator, inference, or HonestDiD propagation paths.
  • I did find two P3 items: the standalone twowayfeweights() helper still does not mirror fit()’s replicate-weight rejection, and two touched docs still describe the old NotImplementedError gating behavior.

Methodology

Code Quality

  • Severity P3. Impact: fit() explicitly rejects replicate-weight survey designs in diff_diff/chaisemartin_dhaultfoeuille.py:L656, but twowayfeweights() only checks weight_type and otherwise proceeds in diff_diff/chaisemartin_dhaultfoeuille.py:L5055. That leaves the new fit/helper parity contract incomplete for unsupported replicate-weight inputs. Concrete fix: add the same NotImplementedError guard to twowayfeweights(), or explicitly document that the helper ignores replicate columns and is full-sample-pweight-only.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: two touched docs still describe the pre-PR gating behavior. docs/llms-full.txt:L273 still says “Phase 3 parameters raise NotImplementedError,” and ROADMAP.md:L194 still says the Phase 2/3 fit() params raise until implemented, even though survey support and the other Phase 3 surfaces are now live and docs/methodology/REGISTRY.md:L630 reflects that. Concrete fix: update both lines to say that only aggregate remains gated.
  • Severity P3. Impact: the new tests cover the repaired TWFE/zero-weight paths and the non-pweight helper rejection in tests/test_survey_dcdh.py:L714, tests/test_survey_dcdh.py:L733, and tests/test_survey_dcdh.py:L791, but there is still no negative test for a replicate-weight survey_design passed to twowayfeweights(). Concrete fix: add a helper-level regression that asserts the same rejection or explicitly documented behavior as fit().

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8b6a6b1fcfd7191dcfcfa6240522fb72e6b065cb


Overall Assessment

✅ Looks good

Static review only. I could not run the test suite in this shell because pytest is unavailable, but the touched Python files parse successfully. The prior re-review items appear addressed: zero-weight rows are filtered before weighted cell validation/counting in diff_diff/chaisemartin_dhaultfoeuille.py:L212-L247, the survey TWFE diagnostic now uses w_gt consistently in diff_diff/chaisemartin_dhaultfoeuille.py:L4843-L4949, and twowayfeweights() now rejects replicate-weight designs in parity with fit() at diff_diff/chaisemartin_dhaultfoeuille.py:L5055-L5080.

Executive Summary

  • The methodology-sensitive survey extensions that depart from the dCDH papers are explicitly documented in the registry, especially the IF expansion and survey+bootstrap fallback at docs/methodology/REGISTRY.md:L651-L653, so I did not treat them as defects.
  • The survey-aware heterogeneity path is also documented and the implementation matches that stated contract in docs/methodology/REGISTRY.md:L618-L618 and diff_diff/chaisemartin_dhaultfoeuille.py:L3232-L3408.
  • HonestDiD now correctly receives df_survey from dCDH results via diff_diff/honest_did.py:L970-L988.
  • I did not find any unmitigated P0/P1 issue in the changed estimator, weighting, variance/SE, control-pool logic, or inference surfaces.
  • One P3 documentation issue remains: the dCDH results dataclass docstring still says survey_metadata is always None and survey support is deferred.

Methodology

  • Severity P3. Impact: The main survey variance path (psi_i = U[g] * (w_i / W_g)), the survey-aware heterogeneity regression, and the survey+bootstrap fallback are library extensions beyond the dCDH source papers, but they are explicitly documented in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653. Under the review rules, these are informational rather than defects. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2e6cbe278d493f23f727a5296a215e5c88ee26e7


Overall assessment

⚠️ Needs changes

Static review only. This shell does not have numpy/pytest, so I could not execute the test suite, but the touched Python files parse in-memory. The prior re-review’s docstring issue is fixed in diff_diff/chaisemartin_dhaultfoeuille_results.py.

Executive summary

Methodology

  • Severity P1. Impact: the new survey subpopulation contract is only partially implemented. _validate_and_aggregate_to_cells() validates group, time, treatment, and outcome on the full raw frame before the zero-weight filter runs, and later survey-backed validators for controls, trends_nonparam, and heterogeneity also scan full data. That means rows that are explicitly zero-weighted and meant to be out-of-sample can still raise ValueError, which breaks legitimate SurveyDesign.subpopulation() usage rather than ignoring excluded rows. Concrete fix: build a positive-weight mask once when survey_design is active and apply all content validation/time-invariance checks to that effective sample, not the raw frame; add regressions with zero-weight excluded rows containing NaN/non-numeric outcome, treatment, controls, set IDs, and heterogeneity values. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.
  • Severity P3. Impact: the survey IF expansion, survey-aware heterogeneity regression, and survey+bootstrap fallback are explicitly documented in the methodology registry as library extensions/deviations, so I am not treating those as defects under the review rubric. Concrete fix: none. See REGISTRY.md and REGISTRY.md.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I also did not find an existing mitigation entry for the two P1 items under TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity P1. Impact: the new survey test file covers core survey paths, but it does not exercise survey_design together with trends_linear=True, trends_nonparam=..., or design2=True. Because survey_design is a new public parameter and those branches still run through the same survey-weighted cell setup and inference surfaces, this leaves real interaction risk untested. Concrete fix: add one survey-backed regression each for trends_linear, trends_nonparam, and design2 that asserts the documented output contract (linear_trends_effects, trimmed event-study/placebo behavior, and design2_effects respectively) and survey-aware inference/metadata behavior where applicable. See chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille.py, and tests/test_survey_dcdh.py.

Path to Approval

  1. Apply the zero-weight subpopulation contract consistently: when survey_design is active, run outcome/treatment/group/time validation and the controls/trends_nonparam/heterogeneity validators on weights > 0 rows only.
  2. Add regression tests where excluded zero-weight rows contain invalid analysis values, proving they no longer abort fit().
  3. Add survey-backed tests for trends_linear=True, trends_nonparam=..., and design2=True, drop_larger_lower=False.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c1a590227ddc92c20e75fec641cef9209fdca6ea


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the test suite in this environment because pytest is unavailable, and /usr/bin/python also lacks numpy, so the assessment below is from code inspection.

Executive Summary

  • The prior re-review’s survey interaction coverage gap for trends_linear, trends_nonparam, and design2 appears resolved by the new tests in tests/test_survey_dcdh.py:L895, tests/test_survey_dcdh.py:L922, and tests/test_survey_dcdh.py:L973.
  • The main zero-weight subpopulation fix is only partial. Outcome/treatment/set/heterogeneity validation now respects positive weights, but the new DID^X controls path still breaks when excluded zero-weight rows are present.
  • The methodology-sensitive survey deviations themselves are documented in docs/methodology/REGISTRY.md:L618, docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, and docs/methodology/REGISTRY.md:L653, so I am not counting those as defects.
  • I did not find a new NaN/Inf inference anti-pattern in the touched analytical surfaces.

Methodology

  • Severity P3 | Impact: The observation-level IF expansion, survey-aware heterogeneity WLS/TSL path, and the survey+bootstrap fallback are explicitly documented in docs/methodology/REGISTRY.md:L618, docs/methodology/REGISTRY.md:L651, docs/methodology/REGISTRY.md:L652, and docs/methodology/REGISTRY.md:L653. Under the review rubric, these are documented deviations/extensions, not defects. Concrete fix: none.

Code Quality

  • Severity P1 | Impact: survey_design combined with controls still violates the zero-weight subpopulation contract. The controls branch validates only positive-weight rows at diff_diff/chaisemartin_dhaultfoeuille.py:L738, but then writes those shorter control arrays back into an unfiltered x_agg_input built from the full frame at diff_diff/chaisemartin_dhaultfoeuille.py:L764 and diff_diff/chaisemartin_dhaultfoeuille.py:L765. Any legitimate SurveyDesign.subpopulation() / zero-weight excluded row will therefore raise before covariate aggregation instead of being ignored, which breaks the newly documented survey DID^X path. Concrete fix: build x_agg_input from the same positive-weight mask, or keep full-length control arrays and scope only validation to the mask; in either case, centralize one effective-sample view and reuse it throughout the survey subpopulation-sensitive code.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a mitigating entry for the P1 above under TODO.md:L51, so it remains unmitigated for assessment purposes.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The new survey test file exercises controls only on all-positive-weight data at tests/test_survey_dcdh.py:L528, so it cannot catch the broken zero-weight DID^X interaction above. The added zero-weight regressions at tests/test_survey_dcdh.py:L816, tests/test_survey_dcdh.py:L845, and tests/test_survey_dcdh.py:L864 never enter the controls branch. Concrete fix: add a regression for survey_design + controls=["x"] + zero-weight excluded row, ideally with a missing or invalid control value on the excluded row, and assert that fit() succeeds.

Path to Approval

  1. Fix the DID^X controls branch so the positive-weight filter and the frame used for covariate aggregation stay aligned under survey_design.
  2. Add a regression test for survey_design + controls + zero-weight excluded row, preferably with malformed excluded control data to pin the subpopulation contract explicitly.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 17, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 36243059c0eb0640cbe5813af72b6d4e32103340


Overall Assessment

✅ Looks good

Static review only. I could not run the test suite here because pytest is unavailable and the default python lacks numpy, so this assessment is from diff inspection and local source review.

Executive Summary

Methodology

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not see a new correctness limitation that needed TODO.md tracking.

Security

  • No findings.

Documentation/Tests

igerber and others added 11 commits April 17, 2026 06:04
- Weighted cell aggregation in _validate_and_aggregate_to_cells()
- Survey resolution via _resolve_survey_for_fit() with pweight-only
  and group-constant validation
- IF expansion from group to observation level for TSL variance
- Survey-aware SE at all call sites (overall, joiners, leavers,
  multi-horizon, placebos) via _compute_se() dispatcher
- Bootstrap + survey warning (PSU-level deferred)
- 12 new tests in test_survey_dcdh.py
- Documentation updates: REGISTRY.md, ROADMAP.md, llms-full.txt,
  choosing_estimator.rst

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…torize IF expansion

- Remove _validate_group_constant_survey() call - the IF expansion
  psi_i = U[g] * (w_i / W_g) handles observation-level variation in
  weights, strata, and PSU within groups correctly
- Vectorize _survey_se_from_group_if using np.bincount + np.unique
  (was Python loops over all observations)
- Replace test_rejects_varying_weights_within_group with two positive
  tests: varying weights accepted, and varying weights change ATT
  (time-varying noise to survive first-differencing)
- Remove unused survey_weight_type variable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P1-A: Scale U_centered by 1/divisor before survey IF expansion.
  dCDH IFs are numerator-scale (U.sum() == N_S * DID_M), but
  compute_survey_if_variance() expects estimator-scale psi.
- P1-B: Zero-weight cells (w_gt <= 0) now treated as absent by
  setting n_gt=0, preventing NaN propagation into estimates.
- P2: Add SE-pinning test (uniform weights + PSU=group matches
  plug-in SE) and zero-weight cell exclusion test.
- P3: Delete unused _validate_group_constant_survey() from survey.py
  that contradicted the supported within-group variation contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ey propagation

- P1-A: Drop zero-weight rows entirely from cell DataFrame (instead
  of just zeroing n_gt) so ragged-panel validator doesn't see them
- P1-B: Survey-weighted covariate aggregation in DID^X path
  (sum(w*x)/sum(w) instead of unweighted mean)
- P1-C: Thread _df_survey to all remaining safe_inference() calls:
  bootstrap t-stats, normalized effects, cost-benefit delta, placebo
  bootstrap t-stats
- P3: Fix REGISTRY overview paragraph (was still saying survey deferred)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P0: delta overall surface now uses _df_survey instead of df=None at both
  safe_inference sites (primary delta path + placebo NaN-SE fallback).
  This makes overall_* under L_max>=2 use survey-t inference and respects
  safe_inference's df<=0 NaN guard.
- P1: HonestDiD dCDH extraction now propagates df_survey from
  survey_metadata (mirrors CS pattern). Survey-backed dCDH HonestDiD
  bounds now use survey-aware critical values.
- P2: Add 4 regressions (survey delta t-matches-reported, t-vs-z differs,
  survey+controls, survey+honest_did df propagation). Update stale comment
  in test_dcdh_extraction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and
  runs survey-aware WLS + Binder TSL IF when survey_design is active.
  Point estimate via solve_ols(weights=W_elig, weight_type='pweight');
  group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded
  to obs-level via w_i/W_g ratio, then compute_survey_if_variance for
  stratified/PSU variance. safe_inference uses df_survey.
  Rank-deficiency short-circuits to NaN to avoid point-estimate/IF
  mismatch between solve_ols's R-style drop and pinv's minimum-norm.
- P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When
  provided, resolves weights via _resolve_survey_for_fit and passes
  them to _validate_and_aggregate_to_cells, restoring fit-vs-helper
  parity under survey-backed inputs. fweight/aweight rejected.
- P3: REGISTRY updates — TWFE parity sentence now includes survey;
  heterogeneity Note documents the TSL IF mechanics and library
  extension disclaimer; checklist line-651 lists survey-aware
  surfaces; new survey+bootstrap-fallback Note after line 652.
- P2: 5 new regression tests in test_survey_dcdh.py:
  TestSurveyHeterogeneity (uniform-weights match, non-uniform beta
  change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper
  match, non-pweight rejection).

All 254 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when
  available, else n_gt) for FE regressions, the normalization
  denominator, contribution weights, and the Corollary 1 observation
  shares. On survey-backed inputs the outputs now match the
  observation-level pweighted TWFE estimand; non-survey path is
  byte-identical.
- P1 #2: Zero-weight rows are dropped before the groupby in
  _validate_and_aggregate_to_cells when weights are provided, so that
  d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight
  subpopulation rows from tripping the fuzzy-DiD guard or inflating
  downstream n_gt counts.
- P2: 2 new regression tests in test_survey_dcdh.py —
  TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols
  verifies beta_fe matches an observation-level pweighted OLS under
  survey (would fail if n_gt was still used), and
  TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation
  verifies an injected zero-weight row with opposite treatment value
  doesn't trip the within-cell constancy check.

All 256 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- twowayfeweights() now raises NotImplementedError when the resolved
  survey design carries replicate weights, matching fit()'s contract.
  The fit/helper parity promise now holds for unsupported inputs too.
- docs/llms-full.txt and ROADMAP.md no longer claim that Phase 3
  parameters raise NotImplementedError; both now correctly note that
  only `aggregate` remains gated.
- Added a helper-level regression that pins the replicate-weight
  rejection, complementing the existing fit()-level test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The results dataclass docstring still described survey_metadata as
'always None' and survey integration as deferred. survey_design is now
implemented and results.survey_metadata is populated whenever a
SurveyDesign is passed to fit(). Docstring now describes the populated
field and its role in survey-aware inference and HonestDiD propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vey branch tests

- P1 #1: The R5 zero-weight filter only ran inside the cell aggregation
  step, after the NaN/coercion checks for group/time/treatment/outcome.
  Moved the filter to the very top of _validate_and_aggregate_to_cells
  so validation only sees the effective sample. fit()'s controls,
  trends_nonparam, and heterogeneity blocks now also scope their
  NaN/time-invariance checks to positive-weight rows when
  survey_weights is active. Legitimate SurveyDesign.subpopulation()
  inputs with NaN in excluded rows now fit cleanly. TSL variance path
  is unchanged (zero-weight obs still contribute zero psi).
- P2: 5 new regression tests in test_survey_dcdh.py —
  TestZeroWeightSubpopulation now covers NaN outcome and NaN het
  columns in excluded rows; new TestSurveyTrendsLinear /
  TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise
  survey_design combined with those previously-untested branches.

All 262 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8's controls-block fix scoped NaN/Inf validation to the positive-weight
subset (shorter data_controls) but then assigned those shorter arrays
into an x_agg_input built from the full-length frame, causing a
length-mismatch on any SurveyDesign.subpopulation() / zero-weight
excluded row before covariate aggregation could run.

Root-caused fix: derive both the validation frame AND the aggregation
frame from the same positive-weight effective sample (data_eff,
survey_weights_eff). Zero-weight rows are genuinely out-of-sample
throughout the DID^X path now. Non-survey fits unchanged.

Added TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_control
pinning the subpopulation contract for the DID^X path — injects a
zero-weight row with NaN control value and asserts fit() succeeds.

All 263 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- diff_diff/guides/llms-practitioner.txt: add reversible-treatment branch
  to the Step 4 decision tree pointing at ChaisemartinDHaultfoeuille
  (alias DCDH), and a short Step 5 example showing the
  survey_design + L_max + TSL fallback for dCDH. The practitioner guide
  previously had zero dCDH coverage despite the full API being live.
- diff_diff/guides/llms-full.txt: fix stale SyntheticDiDResults table
  (line 1024) — variance_method now accepts "bootstrap", "jackknife",
  or "placebo" (jackknife was added for synthdid::vcov parity but the
  table only listed bootstrap/placebo).

Bundled-guide paths only. The old docs/llms-*.txt were moved to
diff_diff/guides/ in main (PR #310); these edits target the new
canonical location consumed by get_llm_guide().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the feature/dcdh-survey branch from 3624305 to d045987 Compare April 17, 2026 10:10
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 17, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d0459878da098c4467398f68846bda0413359e7d


Overall Assessment

⚠️ Needs changes

Static review only. I could not run pytest here because pytest is not installed and the local python environment is missing numpy.

Executive Summary

  • The prior re-review items look resolved: the zero-weight DID^X effective-sample fix is now in chaisemartin_dhaultfoeuille.py:732, and dCDH HonestDiD now propagates df_survey in honest_did.py:970, with matching regressions at test_survey_dcdh.py:864 and test_survey_dcdh.py:552.
  • The registry now documents the high-level survey IF expansion and the survey+bootstrap fallback in REGISTRY.md:651, REGISTRY.md:652, and REGISTRY.md:653, so those deviations themselves are informational rather than blockers.
  • Two new P1 methodology issues remain: the survey SE code for DID_l / dynamic placebos and for the Lemma 7 heterogeneity regression spreads each group-level IF across all rows in the group, not the cell-period rows that actually enter the estimand.
  • A separate P1 edge-case issue remains: zero-weight excluded rows still leak into survey SE factorization through raw group_ids, so valid subpopulation inputs can still fail on mixed-type / missing group IDs.
  • The new tests materially improve coverage for zero-weight rows, trends_linear, trends_nonparam, design2, and HonestDiD, but they still keep PSU/strata constant within group, so the highest-risk new survey path is not exercised.

Methodology

Affected methods: survey-backed DID_l, DID_l^{pl}, any top-level overall_* surface derived from them when L_max >= 1, and the survey-backed Lemma 7 heterogeneity regression. The dynamic paper’s variance object is time-indexed, with U^G_{g,\ell} built from period-specific coefficients on Y_{g,t}, and Binder-style TSL is meant to linearize the estimator through its actual estimating contributions. (nber.org)

  • Severity P1 | Impact: Even accepting the documented survey-extension note in REGISTRY.md:652, the implementation drops the paper’s time dimension before variance estimation. fit() stores only raw group_ids and weights in _obs_survey_info, and _survey_se_from_group_if() then expands each group IF as U[g] * (w_i / W_g) across every row in that group in chaisemartin_dhaultfoeuille.py:4728. That helper is what supplies survey SEs for multi-horizon effects in chaisemartin_dhaultfoeuille.py:1531 and placebo horizons in chaisemartin_dhaultfoeuille.py:1658. Because the paper’s U^G_{g,\ell} is built from period-specific λ_{g,t,\ell} Y_{g,t} terms, rows from unused periods should not be able to change a horizon-specific SE just by moving across PSU/strata; here they can. Concrete fix: keep per-row time or cell IDs in _obs_survey_info and expand the exact (g,t) coefficients onto observations as a_{g,t} * w_i / W_{g,t} before calling compute_survey_if_variance(). (nber.org)
  • Severity P1 | Impact: The survey heterogeneity path has the same defect. _compute_heterogeneity_test() defines the horizon-l regression on S_g * (Y_{g,F_g-1+l} - Y_{g,F_g-1}) in chaisemartin_dhaultfoeuille.py:3223 and builds dep_var from only ref_idx / out_idx in chaisemartin_dhaultfoeuille.py:3303, but the survey branch again allocates psi_g to all rows in the group in chaisemartin_dhaultfoeuille.py:3415. That makes heterogeneity_effects[l]["se"] and its t-based inference depend on periods that never enter the Lemma 7 regression. Concrete fix: allocate the WLS score only across observations in the reference and outcome cells, or store exact per-cell WLS score contributions and linearize those.
  • Severity P3 | Impact: The existence of the survey IF expansion and the survey+bootstrap fallback are explicitly documented in REGISTRY.md:651, REGISTRY.md:652, and REGISTRY.md:653, so I did not count those high-level deviations themselves as defects. Concrete fix: none.

Code Quality

  • Severity P1 | Impact: Zero-weight excluded rows are correctly skipped during cell validation, but they are not fully sanitized out of the survey SE path. _obs_survey_info still captures raw data[group] in chaisemartin_dhaultfoeuille.py:704, and _survey_se_from_group_if() immediately runs np.unique(group_ids, return_inverse=True) on that raw array in chaisemartin_dhaultfoeuille.py:4772. With string-keyed groups, a zero-weight row whose group is NaN or otherwise non-comparable can still fail inside SE factorization even though the row is supposed to be out-of-sample under SurveyDesign.subpopulation(). The new zero-weight tests cover bad outcome, control, and heterogeneity values, but not bad group IDs in test_survey_dcdh.py:813, test_survey_dcdh.py:845, test_survey_dcdh.py:864, and test_survey_dcdh.py:886. Concrete fix: build the group-total factorization from a positive-weight mask, or use a dict/groupby accumulation that ignores zero-weight rows and then zero-pads psi back to full survey length; add a regression with string group IDs plus a zero-weight row whose group is NaN.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see a new TODO.md entry that would mitigate the unaddressed P1 issues above.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The implementation explicitly claims the proportional IF expansion “handles observation-level variation in weights, strata, and PSU within groups” in chaisemartin_dhaultfoeuille.py:669, but the new fixture still assigns strata and PSU constant within group in test_survey_dcdh.py:32, and the only within-group variation that is exercised is weights in test_survey_dcdh.py:267. That leaves the most methodology-sensitive new branch untested. Concrete fix: add a survey fixture with within-group varying PSU/strata and assert that off-horizon rows do not change the reported SE once the cell-level IF expansion is corrected.

Path to Approval

  1. Replace the current group-wide survey IF allocator with a cell-period allocator for DID_l / DID_l^{pl} / top-level overall_* surfaces on L_max >= 1, so only the (g,t) cells that actually enter the estimator receive IF mass.
  2. Apply the same cell-aware score allocation to the survey heterogeneity path, so heterogeneity_effects[l]["se"] is driven only by the reference and outcome cells used in the horizon-l WLS regression.
  3. Sanitize zero-weight excluded rows before group-ID factorization in _survey_se_from_group_if() and add a regression with string group IDs plus a zero-weight NaN group row.
  4. Add tests with within-group varying PSU/strata and with off-horizon rows moved across PSUs to pin the corrected survey-SE contract.

- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call
  it from fit() after the weight_type/replicate-weights checks. The
  dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as
  the effective sampling unit; when strata or PSU vary within group it
  silently spreads horizon-specific IF mass across observations in
  different PSUs, contaminating the stratified-PSU variance. Walk back
  the overstated claim at the old line 669 comment to match. Within-
  group-varying weights remain supported.
- P1 #3: _survey_se_from_group_if now filters zero-weight rows before
  np.unique/np.bincount so NaN / non-comparable group IDs on excluded
  subpopulation rows cannot crash SE factorization. psi stays full-
  length with zeros in excluded positions to preserve alignment with
  resolved.strata / resolved.psu inside compute_survey_if_variance.
- REGISTRY.md line 652 Note updated: explicitly states the
  within-group-constant strata/PSU requirement and the
  within-group-varying weights support.
- Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects
  varying PSU, rejects varying strata, accepts varying weights, and
  ignores zero-weight rows during the constancy check) plus
  TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id.

All 268 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 17, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0266d59ac66cb2e947fee3929121353a830aad38


Overall Assessment

⚠️ Needs changes

Static review only. I could not run pytest here because the local Python environment is missing numpy.

Executive Summary

  • The prior zero-weight SE-factorization issue looks fixed: _survey_se_from_group_if() now removes zero-weight rows before group-ID factorization, so excluded NaN / mixed-type group IDs no longer reach np.unique diff_diff/chaisemartin_dhaultfoeuille.py:4826-4852.
  • The prior HonestDiD df_survey propagation issue also looks fixed: dCDH results now pass survey_metadata.df_survey into HonestDiD extraction diff_diff/honest_did.py:970-977.
  • The previous within-group PSU/strata contamination case is partially addressed: mixed PSU/strata within a group are now rejected, and that restriction is documented diff_diff/chaisemartin_dhaultfoeuille.py:669-677, docs/methodology/REGISTRY.md:651-653.
  • [Newly identified] weights-only and stratified-no-PSU dCDH survey fits still contradict that same methodology note: the implementation says the group is the effective sampling unit, but the shared survey variance code falls back to implicit per-observation PSUs when psu is omitted. That affects SEs, df_survey, and all t-based inference surfaces on a supported input class.
  • The new tests materially improve coverage, but they currently codify the conflicting no-PSU behavior instead of catching it.
  • The high-level survey IF extension and the survey+bootstrap fallback are documented in REGISTRY.md, so I did not count those deviations themselves as defects.

Methodology

  • Severity P1 | Impact: [Newly identified] Supported dCDH survey fits with SurveyDesign(weights=...) and no explicit PSU still use implicit per-observation PSUs, even though the new dCDH methodology note and fit-time validation say the survey IF expansion treats each group as the effective sampling unit. fit() stores the resolved design unchanged diff_diff/chaisemartin_dhaultfoeuille.py:705-713; _survey_se_from_group_if() and the heterogeneity survey path expand group IFs to observation level and pass that unchanged design to compute_survey_if_variance() diff_diff/chaisemartin_dhaultfoeuille.py:4796-4858, diff_diff/chaisemartin_dhaultfoeuille.py:3383-3433; and the shared survey helper explicitly treats psu=None as per-observation PSUs diff_diff/survey.py:1201-1218. That directly conflicts with the estimator-specific registry note docs/methodology/REGISTRY.md:651-652, and it changes overall_se, event-study/placebo SEs, heterogeneity SEs, survey_metadata.df_survey, and downstream t-based inference/HonestDiD critical values on no-PSU designs. Concrete fix: create a dCDH-specific effective survey design that injects group as the PSU whenever survey_design.psu is None (preserving strata/FPC), use that effective design for every compute_survey_if_variance() call, and compute survey_metadata.df_survey from the same effective design; otherwise reject no-PSU dCDH survey designs explicitly.
  • Severity P3 | Impact: The survey IF expansion itself (psi_i = U[g] * (w_i / W_g)) and the group-level bootstrap fallback are explicitly documented dCDH library extensions, so those deviations are informational rather than blockers docs/methodology/REGISTRY.md:651-653. Concrete fix: none.

Code Quality

No additional findings beyond the methodology issue above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No mitigating TODO.md entry covers the P1 above; the existing dCDH TODO items are unrelated TODO.md:55-60.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The new tests currently lock in the conflicting no-PSU contract instead of guarding the documented dCDH one. test_uniform_survey_se_matches_plugin explicitly states that, without psu="group", the survey path should treat each observation as independent tests/test_survey_dcdh.py:388-395, and many new survey-path smoke tests exercise SurveyDesign(weights="pw") with no oracle for unused-period row-count invariance tests/test_survey_dcdh.py:347-358, tests/test_survey_dcdh.py:937-944, tests/test_survey_dcdh.py:1015-1026. Concrete fix: replace/add regressions for weights-only and stratified-no-PSU fits that change unused-period row counts within a group while holding cell means fixed, then assert unchanged SEs/df_survey; if no-PSU designs are made unsupported instead, add explicit rejection tests and update docs accordingly.

Path to Approval

  1. Make dCDH build an effective survey design with psu=group whenever survey_design.psu is None, and derive survey_metadata.df_survey from that effective design.
  2. Use that same effective design in every dCDH survey inference path: _survey_se_from_group_if(), the heterogeneity survey branch, and any downstream consumer of survey_metadata.df_survey such as HonestDiD.
  3. Add regression tests for no-PSU survey fits showing that unused-period row duplication within a group does not change overall_se, event-study/placebo SEs, heterogeneity SEs, or df_survey once cell means and group IFs are unchanged; if you choose to reject no-PSU dCDH designs, replace those with explicit ValueError tests and align the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant