Add Phase 3 PR B: covariates, trends, and extensions for dCDH by igerber · Pull Request #302 · igerber/diff-diff

igerber · 2026-04-13T16:44:30Z

Summary

Implement DID^X covariate residualization (Web Appendix Section 1.2, Assumption 11) via per-baseline OLS on not-yet-treated observations with level residualization (FWL theorem)
Implement DID^{fd} group-specific linear trends (Section 1.3, Lemma 6) via Z_mat first-differencing with per-group cumulation for level effects
Implement state-set-specific trends (Section 1.4) via control-pool restriction in both point estimate and IF computation
Implement heterogeneity testing (Section 1.5, Lemma 7) via saturated OLS with cohort dummies
Implement Design-2 switch-in/switch-out descriptive convenience wrapper (Section 1.6)
Add 3 R parity scenarios (controls, trends_lin, combined) with golden values from DIDmultiplegtDYN v2.3.3
Address AI review P1/P2 findings: isolate per-period path from DID^X residualization, add to_dataframe levels, vectorize residualization, add rank-deficiency guard

Methodology references (required if estimator / math changes)

Method name(s): ChaisemartinDHaultfoeuille (dCDH) - DID^X, DID^{fd}, state-set trends, heterogeneity testing, Design-2
Paper / source link(s): de Chaisemartin & D'Haultfoeuille (2024), "Difference-in-Differences Estimators of Intertemporal Treatment Effects," NBER WP 29873 - Web Appendix Sections 1.2-1.6
Any intentional deviations from the source (and why):
- SE ~4% smaller than R DIDmultiplegtDYN: Python uses paper's Section 3.7.3 N_l normalization; R uses G (total groups). Both asymptotically valid; documented in REGISTRY.md
- Cumulated DID^{fd} SE uses conservative upper bound (sum of per-horizon SEs); cross-horizon covariance formula is a library extension (paper proves per-horizon only). Documented in REGISTRY.md
- Design-2 is a descriptive convenience wrapper, not full paper re-estimation with specialized control pools. Documented in REGISTRY.md
- per_period_effects stays unadjusted when controls are active (DID^X applies only to per-group path). Documented in REGISTRY.md

Validation

Tests added/updated: tests/test_chaisemartin_dhaultfoeuille.py (25 new tests), tests/test_chaisemartin_dhaultfoeuille_parity.py (3 new R parity tests)
R parity: 3 new scenarios generated via DIDmultiplegtDYN v2.3.3 (controls 0.3%, trends exact, combined 0.6% point estimate gaps)
152 tests pass, 0 failures

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implements ROADMAP items 3a-3e and 3i: - DID^X covariate residualization (Web Appendix Section 1.2) - DID^{fd} group-specific linear trends (Section 1.3, Lemma 6) - State-set-specific trends (Section 1.4) - Heterogeneity testing (Section 1.5, Lemma 7) - Design-2 switch-in/switch-out convenience wrapper (Section 1.6) - R parity tests for controls and trends_lin scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1 fixes: - DID^X residualization no longer leaks into per-period path: per_period_effects uses raw Y_mat, only multi-horizon path sees residualized outcomes - Added to_dataframe levels for heterogeneity and linear_trends P2 fixes: - Covariate coercion no longer mutates caller's DataFrame - Vectorized residualization (einsum replaces nested loop) - Heterogeneity test guards against rank-deficient OLS - Added estimand contract test for controls + L_max=1 - REGISTRY note clarifies per_period_effects stays unadjusted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-13T16:53:39Z

Overall Assessment

⛔ Blocker

Executive Summary

The new DID^{fd} cumulation path can emit finite se/p_value/CI even when one of the component horizon SEs is NaN, which violates the library’s NaN-consistency contract and can understate uncertainty.
For trends_linear=True with L_max >= 2, the top-level overall_* surface is still built from the Phase 2 delta on second-difference DID^{fd}_l, but the results object relabels it as delta^{fd}. That silently changes the reported estimand.
The new OLS helper calls hardcode rank_deficient_action="warn", so the DID^X and heterogeneity paths no longer honor the estimator’s existing rank-deficiency contract.
The added tests exercise linear_trends_effects parity, but not the top-level trend-adjusted overall_* surface or cumulated NaN propagation, so the two issues above are currently unguarded.

Methodology

Severity: P0. Impact: linear_trends_effects can report finite inference when the cumulative SE is actually undefined. In the DID^{fd} cumulation block, non-finite component SEs are filtered out before summation, so a horizon with SE=NaN is silently dropped from the upper-bound calculation; if all component SEs are non-finite, the stored SE becomes 0.0. That violates the project’s all-or-nothing NaN rule for inference and can materially understate uncertainty in trend-adjusted outputs. See diff_diff/chaisemartin_dhaultfoeuille.py:2052 and docs/methodology/REGISTRY.md:614. Concrete fix: when building linear_trends_effects[l], require every component horizon 1..l to have a finite SE; otherwise set se, t_stat, p_value, and conf_int to NaN. Also store NaN rather than 0.0 when the cumulative SE is undefined.
Severity: P1. Impact: for trends_linear=True and L_max >= 2, the code still computes cost_benefit_delta from multi_horizon_dids before cumulating DID^{fd}_l into level effects, but the results surface then relabels the top-level estimand as delta^{fd}. That mixes second-difference inputs with a level-effect label, so results.overall_att, summary(), and to_dataframe("overall") can report the wrong estimand with no warning. See diff_diff/chaisemartin_dhaultfoeuille.py:1476, diff_diff/chaisemartin_dhaultfoeuille.py:1931, and diff_diff/chaisemartin_dhaultfoeuille_results.py:421. Concrete fix: either compute a properly defined trend-adjusted aggregate from the cumulated level path, or stop relabeling the top-level overall surface as delta^{fd} until that aggregate exists.

Code Quality

Severity: P2. Impact: the new DID^X and heterogeneity OLS paths ignore the estimator’s public rank_deficient_action contract by hardcoding "warn", so ChaisemartinDHaultfoeuille(rank_deficient_action="error") will not fail fast on singular covariate or heterogeneity regressions. See diff_diff/chaisemartin_dhaultfoeuille.py:369, diff_diff/chaisemartin_dhaultfoeuille.py:2749, and diff_diff/chaisemartin_dhaultfoeuille.py:2938. Concrete fix: thread self.rank_deficient_action through _compute_covariate_residualization() and _compute_heterogeneity_test(), and add regression tests for "error" on both paths.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings beyond the test/doc gap below; none of the substantive issues above are tracked in TODO.md:51.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests do not cover the two most fragile DID^{fd} surfaces. The unit suite only checks that linear_trends_effects cumulates correctly and that overall_att is finite, while the parity suite explicitly compares R only against linear_trends_effects, not the top-level overall_* surface. That leaves the mislabeled delta^{fd} summary and the NaN-propagation bug unguarded. See tests/test_chaisemartin_dhaultfoeuille.py:2489 and tests/test_chaisemartin_dhaultfoeuille_parity.py:403. Concrete fix: add one regression test for overall_att/overall_se/estimand under trends_linear=True, L_max>=2, and one NaN-consistency test where a component horizon SE is NaN and linear_trends_effects[l] must also be all-NaN.
Severity: P3. Impact: the public fit() docstring still says controls, trends_linear, and trends_nonparam are “Reserved for Phase 3” and raise NotImplementedError, which is now false. See diff_diff/chaisemartin_dhaultfoeuille.py:527. Concrete fix: update the parameter docs to describe the shipped contracts and prerequisites.

Path to Approval

Make linear_trends_effects NaN-consistent: if any component SE up to horizon l is non-finite, set the cumulative se/t_stat/p_value/conf_int to NaN and add a regression test.
Fix the trends_linear=True, L_max>=2 top-level overall surface so it reports a real trend-adjusted aggregate instead of relabeling the old Phase 2 delta; add a regression test for overall_att/overall_se/estimand.
Propagate rank_deficient_action into the new OLS helper paths and add tests for rank_deficient_action="error" on DID^X and heterogeneity fits.

…nt_action P0: Cumulated DID^{fd} SE now requires ALL component SEs to be finite; non-finite SE at any horizon propagates NaN (was silently dropped). P1: trends_linear + L_max>=2 overall surface now reports cumulated level effects from linear_trends_effects (was second-difference delta). cost_benefit_delta suppressed under trends_linear (meaningless on second-differences). P2: rank_deficient_action threaded through _compute_covariate_residualization and _compute_heterogeneity_test (was hardcoded "warn"). P3: fit() docstrings updated for controls, trends_linear, trends_nonparam (were stale "Reserved for Phase 3" text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T17:06:33Z

/ai-review

github-actions · 2026-04-13T17:14:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2c6cabfd5922da4358ea627b4ff837825a5fd852

Overall Assessment

⚠️ Needs changes

Executive Summary

Previous re-review blockers around cumulated NaN propagation and rank_deficient_action handling appear resolved in the changed code.
Severity P1: trends_linear=True still exposes a top-level overall_* scalar for L_max >= 2, but the reference did_multiplegt_dyn(..., trends_lin=TRUE) contract says the average total effect is not computed in that mode. The PR now hard-codes overall_* = linear_trends_effects[max_h] and labels it as delta/cost-benefit output. (rdrr.io)
Severity P1: the new Section 1.5 heterogeneity path does not match the reference contract. Official docs say predict_het includes placebo regression tables plus a joint null test, and it cannot be combined with controls; this PR returns only post-treatment coefficients and silently allows controls. (rdrr.io)
The newly added REGISTRY notes for the R SE normalization gap, conservative cumulated DID^{fd} SE, Design-2 as a convenience wrapper, and the unadjusted per-period path under controls are documented deviations and should not count against the PR.

Methodology

Severity: P1. Impact: the trends_linear=True, L_max>=2 results surface still reports a scalar “overall” estimand even though the reference method does not compute an average total effect in this mode. The implementation suppresses cost_benefit_delta, then repurposes overall_* to the last-horizon cumulative effect and labels it as delta^{fd} / “Cost-Benefit Delta”, while the REGISTRY only documents horizon-specific cumulated effects in results.linear_trends_effects. This is an undocumented methodology deviation that can mislead downstream consumers of results.overall_att, summary(), and to_dataframe("overall"). References: diff_diff/chaisemartin_dhaultfoeuille.py:L2107-L2121, diff_diff/chaisemartin_dhaultfoeuille_results.py:L421-L449, diff_diff/chaisemartin_dhaultfoeuille_results.py:L563-L571, docs/methodology/REGISTRY.md:L614-L620, tests/test_chaisemartin_dhaultfoeuille.py:L2553-L2566. Concrete fix: when trends_linear=True and L_max>=2, make overall_* unavailable/NaN (or move the last-horizon cumulative effect to a separately named field), and stop labeling it as cost-benefit delta in summary()/to_dataframe(). (rdrr.io)
Severity: P1. Impact: the shipped heterogeneity option is not the Section 1.5 contract it claims to implement. The reference predict_het path includes placebo regressions and a joint null test, and it explicitly disallows the controls option; this PR instead computes only post-treatment regressions, never exposes placebo/joint-test outputs, and silently allows controls, with internally inconsistent behavior because controls+heterogeneity uses residualized outcomes unless trends_linear=True, in which case it falls back to raw levels. That is both a missing assumption check and an undocumented methodology deviation. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2127-L2162, diff_diff/chaisemartin_dhaultfoeuille.py:L2861-L2993, docs/methodology/REGISTRY.md:L618-L620. Concrete fix: reject heterogeneity when controls is supplied, add the placebo-regression / joint-test surface required by the source contract, and either propagate other allowed options (trends_nonparam, trends_linear) through the heterogeneity path or reject those combinations explicitly. (rdrr.io)
Severity: P3. Impact: the PR’s explicit REGISTRY notes for DID^X per-period behavior, conservative cumulated DID^{fd} SEs, Design-2 scope, and the R SE normalization gap are sufficient mitigation; I would not count those as defects. References: docs/methodology/REGISTRY.md:L612-L620. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology surface overloading noted above.

Tech Debt

No findings beyond the blockers above; neither blocker is tracked in TODO.md under “Tech Debt from Code Reviews.”

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the public docs did not fully catch up with the new API surface. fit() adds heterogeneity and design2, but the parameter docstring stops before documenting them, and to_dataframe() now accepts heterogeneity / linear_trends while its fallback error text still lists only the older levels. References: diff_diff/chaisemartin_dhaultfoeuille.py:L487-L545, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1032-L1057. Concrete fix: document the new fit() parameters and update to_dataframe() help/error text to include the new supported levels.

Path to Approval

Remove or rename the top-level overall_* surface for trends_linear=True, L_max>=2 so it no longer presents the last-horizon cumulative effect as cost-benefit delta / delta^{fd}; update summary(), to_dataframe("overall"), and the unit test at tests/test_chaisemartin_dhaultfoeuille.py:L2553-L2566.
Bring heterogeneity in line with the Section 1.5 contract: reject controls, add placebo-regression outputs and the joint null test, and add regression tests for both the rejection path and the placebo/joint-test surface.
Decide the allowed interaction set for heterogeneity with trends_nonparam / trends_linear; either propagate those options through the heterogeneity path or reject them explicitly, and add tests for every allowed combination.

…geneity P1: trends_linear + L_max>=2 overall_* is now NaN (R does not compute an aggregate in trends_lin mode). Cumulated effects available via results.linear_trends_effects[l]. P1: heterogeneity + controls now raises ValueError (matching R's predict_het which disallows controls). REGISTRY documents heterogeneity as partial implementation (post-treatment only, no placebo regressions or joint null test). P3: Added fit() docstrings for heterogeneity and design2 parameters. P3: Updated to_dataframe() error text with new level names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T17:21:37Z

/ai-review

github-actions · 2026-04-13T17:31:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6f1c99fa5bb81c2cf04146a6f6206fc48cca037f

Overall Assessment

⚠️ Needs changes

Executive Summary

Previous blocker on trends_linear=True, L_max>=2 is mostly resolved numerically: the PR now suppresses the scalar overall_* surface instead of reusing the last-horizon cumulative effect, which matches the reference contract that trends_lin=TRUE does not compute an average total effect. citeturn0view0
Previous blocker on heterogeneity + controls is resolved, and the missing placebo/joint-test surface is now documented in the Methodology Registry; under the review rubric, that documented deviation is informational rather than blocking. citeturn0view2
Severity P1: heterogeneity still ignores trends_linear / trends_nonparam, so the heterogeneity table can diverge from the estimator actually fit.
Severity P1: heterogeneity is documented as requiring L_max >= 1, but fit(..., heterogeneity=...) silently no-ops when L_max=None.
Severity P1: design2=True silently returns no Design-2 output under the default drop_larger_lower=True, because the required 2-switch groups are dropped before the wrapper runs.
Remaining trends_linear labeling issues are P3 informational: the NaN overall row is still labeled as delta / cost-benefit delta.

Methodology

Affected methods in this PR are DID^X covariate adjustment (controls), DID^{fd} linear trends (trends_linear), state-set trends (trends_nonparam), heterogeneity testing (predict_het-style output), and the Design-2 join/leave wrapper. Those are the same Phase 3 extensions described in the reference package’s Sections 1.2-1.6. citeturn0view0turn0view1turn0view2turn0view3

Severity: P1. Impact: heterogeneity allows trends_linear and trends_nonparam, but the implementation still builds the regression on raw level changes and never threads set_ids, so those adjustments are silently ignored in the heterogeneity surface. That means users can request heterogeneity for the adjusted estimator and receive coefficients for a different estimand. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2166, diff_diff/chaisemartin_dhaultfoeuille.py#L2877, diff_diff/chaisemartin_dhaultfoeuille.py#L3215, diff_diff/chaisemartin_dhaultfoeuille.py#L3390, tests/test_chaisemartin_dhaultfoeuille.py#L2709. Concrete fix: either reject heterogeneity with trends_linear / trends_nonparam up front, or rework _compute_heterogeneity_test() so it operates on the same adjusted group-level effects used by the requested estimator, then add dedicated coverage for every allowed combination. The reference predict_het contract is about groups’ estimated event-study effects, and the manual does not list either option as incompatible. citeturn0view0turn0view1turn0view2
Severity: P1. Impact: heterogeneity is documented as requiring L_max >= 1, but there is no validation path for heterogeneity when L_max is omitted; the branch simply never runs. This is a silent unsupported-parameter no-op on a new public API surface. References: diff_diff/chaisemartin_dhaultfoeuille.py#L542, diff_diff/chaisemartin_dhaultfoeuille.py#L2135, docs/methodology/REGISTRY.md#L618. Concrete fix: mirror the controls / trends_linear / trends_nonparam validation style and raise ValueError when heterogeneity is set without L_max, plus add a regression test next to the existing heterogeneity cases.
Severity: P1. Impact: design2=True is documented as requiring drop_larger_lower=False, but the implementation never enforces that precondition. Under the default drop_larger_lower=True, Step 6 drops exactly the >1-change groups Design-2 needs, and _compute_design2_effects() then returns None as if no Design-2 groups existed. References: diff_diff/chaisemartin_dhaultfoeuille.py#L548, diff_diff/chaisemartin_dhaultfoeuille.py#L792, diff_diff/chaisemartin_dhaultfoeuille.py#L2297, diff_diff/chaisemartin_dhaultfoeuille.py#L3012, docs/methodology/REGISTRY.md#L620, tests/test_chaisemartin_dhaultfoeuille.py#L2800. Concrete fix: hard-error or at least warn when design2=True and self.drop_larger_lower is still True, or compute Design-2 on a pre-Step-6 side path so the wrapper cannot silently disappear.

Code Quality

No separate findings beyond the missing parameter-validation/propagation issues above.

Performance

No findings.

Maintainability

No separate findings beyond the API-contract drift noted below.

Tech Debt

No mitigating TODO.md entry exists for the P1 issues above, so they remain unmitigated. TODO.md only tracks older dCDH Phase 1 items in this area. Reference: TODO.md#L51.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the docs/tests still lag the new surface. to_dataframe()’s docstring does not list heterogeneity or linear_trends, the results dataclass docs still omit heterogeneity_effects / design2_effects, and fit() still says non-default forward-compat params raise NotImplementedError. The tests also miss the failing interaction cases above (heterogeneity without L_max, heterogeneity x trends_*, design2 under default drop_larger_lower=True). References: diff_diff/chaisemartin_dhaultfoeuille.py#L567, diff_diff/chaisemartin_dhaultfoeuille_results.py#L322, diff_diff/chaisemartin_dhaultfoeuille_results.py#L825, tests/test_chaisemartin_dhaultfoeuille.py#L2709, tests/test_chaisemartin_dhaultfoeuille.py#L2800. Concrete fix: update the docstrings and add the missing interaction tests as part of the P1 fixes.
Severity: P3. Impact: summary() / to_dataframe("overall") still label the trends_linear NaN overall row as delta / “Cost-Benefit Delta”, even though the reference contract says no average total effect is computed with trends_lin=TRUE. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2118, diff_diff/chaisemartin_dhaultfoeuille_results.py#L563, diff_diff/chaisemartin_dhaultfoeuille_results.py#L849. Concrete fix: relabel that row as unavailable / not computed, or omit it entirely when linear_trends_effects is active and L_max >= 2. citeturn0view0

Path to Approval

Add an upfront ValueError when heterogeneity is passed without L_max, and add a regression test using the existing heterogeneity fixture that asserts the failure.
Either reject heterogeneity with trends_linear / trends_nonparam, or plumb those adjustments into _compute_heterogeneity_test() so it operates on the same adjusted estimand as the fitted event study; add tests for every allowed combination.
Enforce the design2 precondition by erroring/warning when design2=True and drop_larger_lower=True, or compute Design-2 from a pre-Step-6 side path; add a regression test on _make_join_then_leave_panel() under default constructor settings.

P1: heterogeneity without L_max now raises ValueError (was silent no-op). P1: heterogeneity rejects trends_linear and trends_nonparam (would produce coefficients for wrong estimand since het test uses raw level changes). P1: design2=True with drop_larger_lower=True now raises ValueError (the 2-switch groups Design-2 needs are dropped by the default filter). P3: NaN overall row under trends_linear+L_max>=2 now labeled as "DID^{fd}_l (see linear_trends_effects)" instead of "Cost-Benefit Delta". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T17:35:22Z

/ai-review

github-actions · 2026-04-13T17:43:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3f7e72c0fac612bdb456e2506372930acb1599f2

Overall Assessment

⛔ Blocker

Executive Summary

Re-review status: the prior P1s around heterogeneity validation and the design2=True / drop_larger_lower precondition appear fixed.
Affected methodology in this PR: DID^X covariate residualization, DID^{fd} linear trends, state-set trends, heterogeneity testing, and the Design-2 wrapper.
Severity P0 [Newly identified]: the DID^X first-stage OLS drops a reference time dummy without adding an intercept, so theta_hat is estimated from the wrong model.
Severity P0 [Newly identified]: the heterogeneity regression drops a reference cohort dummy without adding an intercept, so beta^{het}_l is not the saturated OLS described in the registry/docstring.
Severity P1 [Newly identified]: the claimed DID^X rank-deficiency guard is incomplete; solve_ols() can return NaN coefficients for dropped controls, and the residualization path then feeds those NaNs straight into np.einsum, poisoning adjusted outcomes.

Methodology

Severity: P0 [Newly identified]. Impact: DID^X residualization is not fitting the documented “first-differenced outcomes on first-differenced covariates with time FEs” model. The helper drops one time dummy as a reference category but never adds the required intercept, and solve_ols() only includes an intercept if the caller supplies one. That forces the omitted time effect to zero and silently biases theta_hat, which then biases every downstream DID^X-adjusted estimate and SE. Concrete fix: build the first-stage design as either [intercept, dX, dropped_time_dummies] or [dX, full_time_dummy_set], then add a regression test where the omitted-period FE is nonzero. References: REGISTRY.md#L612, linalg.py#L446, chaisemartin_dhaultfoeuille.py#L2813.
Severity: P0 [Newly identified]. Impact: _compute_heterogeneity_test() is not the saturated Lemma 7 regression it claims to be. It drops one cohort dummy as a reference category but again omits the intercept, so the omitted cohort mean is constrained to zero. That makes the reported beta, se, and inference for heterogeneity_effects methodologically wrong with no warning. Concrete fix: include an intercept when dropping one cohort dummy, or keep the full cohort-dummy set with no intercept; add a hand-checkable test with two cohorts that have different means. References: REGISTRY.md#L618, linalg.py#L446, chaisemartin_dhaultfoeuille.py#L2990.
Severity: P1 [Newly identified]. Impact: the DID^X rank-deficiency path is unsafe. solve_ols() explicitly returns NaN for dropped coefficients under the default "warn" mode, but _compute_covariate_residualization() immediately uses theta_hat in np.einsum. A duplicated or collinear control can therefore turn every residualized outcome in that baseline cohort into NaN, which then propagates into the estimator. Concrete fix: if any control coefficient is NaN, either refit on the finite subset of retained controls or raise a clear ValueError before residualizing; add a regression test with duplicated controls. References: linalg.py#L460, chaisemartin_dhaultfoeuille.py#L2832, chaisemartin_dhaultfoeuille.py#L2852.

Code Quality

No separate findings beyond the estimator-spec bugs above.

Performance

No findings.

Maintainability

No separate findings beyond the estimator-spec bugs above.

Tech Debt

No matching TODO.md entry mitigates the P0/P1 issues above. Reference: TODO.md#L51.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the public docs/results surface still lag the shipped Phase 3 API. fit() still says forward-compat params raise NotImplementedError, the results docstring still calls Phase 3 outputs “placeholder,” and adjusted event-study rows are still labeled as plain DID_l. Concrete fix: refresh the docstrings and adjusted labels after the methodological fixes land. References: chaisemartin_dhaultfoeuille.py#L567, chaisemartin_dhaultfoeuille_results.py#L322, chaisemartin_dhaultfoeuille_results.py#L717, chaisemartin_dhaultfoeuille_results.py#L994.

Path to Approval

Fix the DID^X first-stage regression so the time-FE model is identified correctly: add an intercept with dropped time dummies, or keep the full time-dummy set with no intercept.
Fix the heterogeneity regression so the cohort-FE model is saturated correctly: add an intercept with dropped cohort dummies, or keep the full cohort-dummy set with no intercept.
Add a real rank-deficiency guard in DID^X before residualization, so dropped control coefficients cannot propagate NaN through Y_resid.
Add targeted tests for all three cases above: one DID^X regression with nonzero omitted-period FE, one heterogeneity regression with nonzero omitted-cohort mean, and one duplicated-control DID^X case.

P0: DID^X first-stage OLS now includes intercept when dropping one time dummy as reference (was forcing omitted period FE to zero). theta_hat extraction updated to skip intercept at index 0. P0: Heterogeneity regression now includes intercept when dropping one cohort dummy as reference (was forcing omitted cohort mean to zero). beta_het extraction updated to read index 1 instead of 0. P1: DID^X now guards against NaN control coefficients from rank-deficient OLS. If any theta_hat entry is NaN, residualization is skipped for that baseline with a UserWarning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T17:49:53Z

/ai-review

github-actions · 2026-04-13T18:01:47Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 357a551a5a3df0478512a40beb887efe5d88e48e

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the prior omitted-intercept P0s appear fixed; both the DID^X first stage and the heterogeneity regression now include an intercept before dropping a reference dummy. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2813, diff_diff/chaisemartin_dhaultfoeuille.py#L3008
The prior DID^X rank-deficiency finding is only partially resolved: NaN propagation is gone, but the fallback now disables covariate adjustment for an entire baseline whenever any one control is dropped.
Severity P1 [Newly identified]: trends_nonparam does not enforce the source-material requirement that the set definition be coarser than group; over-granular inputs fail late through empty control pools instead of a targeted validation error.
Severity P1 [Newly identified]: heterogeneity now hard-rejects trends_linear and trends_nonparam, but that extra restriction is not documented in REGISTRY.md or in the public fit() docstring.
No separate performance or security concerns stood out in the changed files.

Methodology

Severity P1. Impact: _compute_covariate_residualization() aborts DID^X residualization for the whole baseline as soon as any control coefficient is NaN, even though local solve_ols() already preserves the identified coefficients and computes residuals from the reduced design. That means one collinear or duplicated control turns a baseline-specific DID^X path back into a raw, unadjusted path while the public surface is still labeled as DID^X. REGISTRY.md currently says the downstream DID path uses residualized outcomes, so this is an undocumented estimator mismatch. Concrete fix: when some controls are dropped, refit on the finite subset or apply residualization with the finite theta_hat entries instead of continue; add a regression test with duplicated controls where one control remains estimable. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2851, diff_diff/linalg.py#L485, docs/methodology/REGISTRY.md#L612. (cran.r-universe.dev)
Severity P1 [Newly identified]. Impact: trends_nonparam validates only time-invariance, but the reference contract also requires the set/interacted-set definition to be coarser than group. With a group-level or otherwise over-granular set ID, the implementation falls through to empty same-set control pools and either excludes switchers or raises a generic horizon-1 “no switching groups” error, rather than a targeted assumption-check failure. Concrete fix: add an explicit pre-fit check that the trends_nonparam partition is coarser than group and add a targeted test using trends_nonparam=group. References: diff_diff/chaisemartin_dhaultfoeuille.py#L1055, diff_diff/chaisemartin_dhaultfoeuille.py#L3387, tests/test_chaisemartin_dhaultfoeuille.py#L2656. (cran.r-universe.dev)
Severity P1 [Newly identified]. Impact: the new heterogeneity path rejects trends_linear and trends_nonparam outright, but the reference interface documents predict_het as incompatible with normalized and controls, not with trend-adjusted or state-set specifications. Because REGISTRY.md and the fit() docstring mention only the controls restriction, this is an undocumented methodology/API deviation rather than a documented limitation. Concrete fix: either implement heterogeneity on the trends_linear and trends_nonparam paths, or document those extra restrictions in REGISTRY.md and the public docstring, then align the tests to that documented contract. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2145, docs/methodology/REGISTRY.md#L618, tests/test_chaisemartin_dhaultfoeuille.py#L2784, tests/test_chaisemartin_dhaultfoeuille.py#L2793. (cran.r-universe.dev)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the methodology/contract issues above.

Tech Debt

No matching TODO.md entry mitigates the P1 findings above; the existing dCDH TODO items cover only Phase 1 placebo SE and mixed-direction parity. Reference: TODO.md#L51

Security

No findings.

Documentation/Tests

Severity P3. Impact: adjusted event-study outputs are still labeled as plain DID_l in the public summary/dataframe surfaces, so users cannot tell whether they are looking at raw, DID^X, or DID^{fd} horizons. Concrete fix: thread adjusted horizon labels through summary() and to_dataframe("event_study"). References: diff_diff/chaisemartin_dhaultfoeuille_results.py#L713, diff_diff/chaisemartin_dhaultfoeuille_results.py#L987

Path to Approval

In _compute_covariate_residualization(), stop disabling DID^X for an entire baseline when only some controls are dropped; residualize with the retained controls and add a duplicated-control regression test.
Add a targeted trends_nonparam validation that rejects set definitions that are not coarser than group, and cover it with a trends_nonparam=group test.
Either implement heterogeneity for trends_linear and trends_nonparam, or explicitly document those extra restrictions in REGISTRY.md and the fit() docstring and keep the rejection tests as the documented contract.

P1: DID^X rank-deficiency now residualizes with finite subset of theta_hat (zeroing NaN coefficients) instead of skipping entirely. P1: trends_nonparam now rejects set definitions that are not coarser than group (singleton sets have no within-set controls). P1: heterogeneity restrictions on trends_linear and trends_nonparam now documented in REGISTRY.md and fit() docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T18:28:50Z

/ai-review

github-actions · 2026-04-13T18:41:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bc1dab76d3175c18152776658a22a427d5b4a98d

Overall Assessment

⛔ Blocker

Executive Summary

Prior re-review items are mostly addressed: the DID^X collinearity fallback now keeps identified controls, trends_nonparam=group now errors, and the heterogeneity restrictions are documented in both fit() and the Methodology Registry. References: diff_diff/chaisemartin_dhaultfoeuille.py:2866, diff_diff/chaisemartin_dhaultfoeuille.py:1082, docs/methodology/REGISTRY.md:618
P0: trends_nonparam is propagated into the main multi-horizon DID and IF paths, but not into the dynamic placebo point-estimate or placebo IF paths, so placebo diagnostics use the wrong control pool under state-set trends.
P1: DID^X first-stage regressions still hard-fail when a baseline-specific control sample has fewer observations than regressors; this new code path does not handle the small-sample edge case gracefully.
No separate performance or security concerns stood out in the changed files.

Methodology

Severity P0. Impact: under trends_nonparam, the PR correctly restricts same-set controls in _compute_multi_horizon_dids() and _compute_per_group_if_multi_horizon(), but the placebo paths still use unrestricted baseline-matched controls. That means results.placebo_event_study, placebo SEs, and placebo p-values are silently computed for the wrong estimator whenever state-set trends are requested. The CRAN manual defines trends_nonparam as comparing switchers only to not-yet-switched controls with the same baseline treatment and the same set value, and it defines placebos as pre-treatment comparisons between switchers and “their controls,” so the same restriction has to carry into the placebo contract. Concrete fix: thread set_ids through _compute_multi_horizon_placebos() and _compute_per_group_if_placebo_horizon(), apply the same same-set mask used in the positive-horizon DID/IF helpers, and add a regression test where cross-set controls change the placebo estimate and placebo SE under trends_nonparam. References: diff_diff/chaisemartin_dhaultfoeuille.py:1279, diff_diff/chaisemartin_dhaultfoeuille.py:1416, diff_diff/chaisemartin_dhaultfoeuille.py:3406, diff_diff/chaisemartin_dhaultfoeuille.py:3541, diff_diff/chaisemartin_dhaultfoeuille.py:3566, diff_diff/chaisemartin_dhaultfoeuille.py:3661. (search.r-project.org)
Severity P1. Impact: _compute_covariate_residualization() still aborts the fit when any baseline-specific DID^X first-stage has fewer observations than regressors. The reference contract says those regressions are estimated separately by baseline treatment and, if one stratum has fewer observations than variables, the command continues running while dropping some controls and warning the user. In this implementation, the design includes an intercept, all controls, and non-reference time FEs, then solve_ols() raises immediately on n < k, so a legitimate small-stratum panel crashes instead of degrading gracefully. Concrete fix: detect n_obs < n_params before calling solve_ols() and either emulate the reference behavior by dropping unidentified control/time-FE columns with a warning, or skip residualization for that baseline with a documented warning; then add a small-sample regression test with multiple controls and too few not-yet-treated control cells. References: diff_diff/chaisemartin_dhaultfoeuille.py:2828, diff_diff/chaisemartin_dhaultfoeuille.py:2843, diff_diff/linalg.py:545. (search.r-project.org)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the propagation issue above.

Tech Debt

No matching TODO.md entry mitigates the P0/P1 issues above; the existing dCDH TODO items only track Phase 1 placebo SE and mixed-direction parity. Reference: TODO.md:51

Security

No findings.

Documentation/Tests

Severity P3. Impact: adjusted event-study outputs are still labeled as plain DID_l in summary() and to_dataframe("event_study"), so users cannot tell raw, DID^X, and DID^{fd}` horizons apart from the public surface. Concrete fix: thread feature-specific horizon labels into both renderers. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:717, diff_diff/chaisemartin_dhaultfoeuille_results.py:994
Verification note: I did not execute the test suite in this environment; the available Python interpreter is missing numpy.

Path to Approval

Propagate set_ids into _compute_multi_horizon_placebos() and _compute_per_group_if_placebo_horizon(), apply the same same-set control restriction there, and add a trends_nonparam + placebo regression test where cross-set controls would otherwise change the placebo output.
Make _compute_covariate_residualization() handle n_obs < n_params per baseline without aborting the fit; add a test with multiple controls/time FE and too few not-yet-treated control cells.
Rename the public event-study labels in summary() and to_dataframe("event_study") so adjusted horizons are surfaced as adjusted estimands.

P0: trends_nonparam set_ids now threaded into both placebo functions (_compute_multi_horizon_placebos and _compute_per_group_if_placebo_horizon) so placebo diagnostics use same-set controls when state-set trends active. P1: DID^X first-stage now guards n_obs < n_params before calling solve_ols. Skips residualization for that baseline with warning instead of crashing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T18:57:06Z

/ai-review

github-actions · 2026-04-13T19:07:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 000dc46f63ade3e2cc74e6a5b659504f0ba34ac3

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior P0 on trends_nonparam placebo control pools appears fixed: set_ids now reaches both placebo point-estimate and placebo IF helpers.
P1: DID^X still diverges from the source contract when a baseline-specific first stage is empty or underidentified; the code now warns and leaves that baseline unresidualized instead of handling it the way the reference implementation does.
P1 [Newly identified]: the new DID^X first-stage regressions ignore collapsed cell weights on disaggregated inputs, so results can silently diverge from the documented/reference estimator when (g,t) cell sizes differ.
P3: adjusted event-study outputs are still labeled as plain DID_l in the public renderers.
No separate performance or security concerns stood out in the changed files.

Methodology

Severity P1. Impact: method affected is DID^X (Web Appendix Section 1.2). In _compute_covariate_residualization(), baseline strata with n_obs == 0 or n_obs < n_params are kept in-sample and left completely unresidualized, so the fitted event-study/overall ATT can silently mix adjusted and raw outcomes for different baseline-treatment levels. The official did_multiplegt_dyn docs say the controls regressions are estimated separately by baseline treatment on not-yet-treated controls, and the reference implementation warns about singular designs by dropping unidentified controls and drops only baseline-treatment levels where residualization actually fails; it does not retain those levels unadjusted. Concrete fix: make failed first stages source-consistent by either estimating with dropped controls/time-FE columns or dropping the failed baseline-treatment level from the estimation sample with an explicit warning and REGISTRY note; add regression tests for both the n_obs == 0 and n_obs < k branches. References: diff_diff/chaisemartin_dhaultfoeuille.py:2793, diff_diff/chaisemartin_dhaultfoeuille.py:2847, REGISTRY.md:612. (search.r-project.org)
Severity P1 [Newly identified]. Impact: method affected is again DID^X. The new first-stage regressions are run via unweighted solve_ols() even though the official contract says estimation is weighted by the number of observations (or summed weights) in each group × time cell when the input is more disaggregated than group × time, and the reference residualization step passes weight_XX into the FE regression. On individual-level panels with unequal cell sizes, this will silently move the residualization away from the reference estimator. Concrete fix: pass the aggregated cell weights for the valid first-difference observations into solve_ols(weights=...), and add a regression test with unequal numbers of rows per (g,t) cell. References: diff_diff/chaisemartin_dhaultfoeuille.py:2864, REGISTRY.md:612. (search.r-project.org)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the public-surface labeling issue below.

Tech Debt

No matching entry in TODO.md:51 mitigates the P1s above; the existing dCDH TODOs only cover Phase 1 per-period placebo SE and mixed-direction parity.

Security

No findings.

Documentation/Tests

Severity P3. Impact: adjusted event-study outputs are still rendered as plain DID_l in both summary() and to_dataframe("event_study"), so users cannot tell raw DID_l from DID^X, DID^{fd}, or DID^{X,fd} horizons from the public surface. Concrete fix: reuse _estimand_label() or thread feature-specific horizon labels into both renderers. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:717, diff_diff/chaisemartin_dhaultfoeuille_results.py:994.
Severity P3. Impact: the new trends_nonparam tests stop at basic L_max=1 point-estimate coverage and do not directly exercise the modified placebo helpers, so the prior placebo-path fix is still unguarded by a targeted regression test. Concrete fix: add a trends_nonparam + placebo test where cross-set controls change both the placebo effect and the placebo SE. References: tests/test_chaisemartin_dhaultfoeuille.py:2642, diff_diff/chaisemartin_dhaultfoeuille.py:3586, diff_diff/chaisemartin_dhaultfoeuille.py:3685.

Path to Approval

Make DID^X first-stage failure handling source-consistent: do not keep baseline-treatment strata unadjusted when residualization is empty or underidentified; either estimate with dropped controls/time FE or drop those strata from the estimation sample, and add regression tests for n_obs == 0 and n_obs < k.
Pass collapsed cell weights into the DID^X first-stage regressions, and add a regression test on disaggregated input with unequal (g,t) cell sizes.

P1: Failed DID^X first-stage strata (n_obs=0 or n_obs < n_params) now have outcomes set to NaN so they're excluded from downstream DID computation. Previously left unadjusted, mixing raw + adjusted. P1: Cell-weight deviation documented in REGISTRY (equal cell weights vs R's N_gt observation-count weights - same Phase 1 convention). P3: Event study rows in to_dataframe("event_study") now labeled as DID^X_h, DID^{fd}_h, DID^{X,fd}_h when adjustments active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T19:09:44Z

/ai-review

github-actions · 2026-04-13T19:22:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8fd2e604a92e930bd5cfed3243a7c844811c6c66

Overall Assessment

Needs changes

Executive Summary

The prior trends_nonparam placebo control-pool issue appears resolved: set_ids now reaches both the positive-horizon and placebo point-estimate / IF helpers.
The prior DID^X cell-weighting concern is now explicitly documented in docs/methodology/REGISTRY.md, so under the review rules it is informational only, not a blocker.
One prior P1 remains unresolved: failed DID^X first-stage strata are set to NaN during residualization, but they are not actually excluded from downstream horizon eligibility. That can poison adjusted event-study and placebo outputs instead of dropping only the failed strata.
to_dataframe("event_study") now labels adjusted horizons correctly, but summary() still prints plain DID_l, and the combined controls + trends_linear overall label still drops the X.
No separate performance or security problems stood out in the changed files.

Methodology

Severity P1. Impact: This is the prior DID^X failed-first-stage finding, and it is not fully resolved. The new REGISTRY note says failed baseline-specific first stages are “excluded from the estimation,” but the implementation only sets those strata’s residualized outcomes to NaN; it does not carry an exclusion mask into the adjusted horizon builders. fit() swaps in the residualized Y_mat, then _compute_multi_horizon_dids(), _compute_per_group_if_multi_horizon(), _compute_multi_horizon_placebos(), and _compute_per_group_if_placebos() rebuild eligibility only from observability / control-pool rules. A switcher in a failed baseline stratum can therefore remain in eligible, contribute NaN outcome differences, and make otherwise estimable adjusted DID_l, placebo horizons, and their SEs collapse to NaN instead of being removed from N_l / N_pl_l. That is an undocumented mismatch with the Section 1.2 / REGISTRY contract and a missing failed-sample edge-case guard. Concrete fix: return a failed-stratum mask or estimable-baseline set from _compute_covariate_residualization() and intersect it with the eligible sample in all adjusted positive-horizon and placebo point-estimate / IF helpers before denominators are formed. References: diff_diff/chaisemartin_dhaultfoeuille.py:L982-L995, diff_diff/chaisemartin_dhaultfoeuille.py:L2793-L2808, diff_diff/chaisemartin_dhaultfoeuille.py:L2851-L2867, diff_diff/chaisemartin_dhaultfoeuille.py:L3383-L3471, diff_diff/chaisemartin_dhaultfoeuille.py:L3544-L3586, diff_diff/chaisemartin_dhaultfoeuille.py:L3648-L3685, diff_diff/chaisemartin_dhaultfoeuille.py:L3739-L3819, docs/methodology/REGISTRY.md:L612-L620.
Severity P3. Impact: The previously raised DID^X weighting deviation and the Phase 3 extension-scope deviations are now explicitly documented in the Methodology Registry, so they are mitigated under the review rules and do not block approval. This includes the equal-cell-weight DID^X first stage, the conservative cumulated DID^{fd} SE, the partial heterogeneity implementation, and the descriptive Design-2 wrapper. Concrete fix: none required. References: docs/methodology/REGISTRY.md:L612-L620.

Code Quality

No separate findings. The changed inference paths continue to use safe_inference(), and I did not see a new inline t-stat / partial-NaN anti-pattern in the modified files.

Performance

No findings.

Maintainability

No separate findings beyond the public-surface labeling issue below.

Tech Debt

No separate findings. TODO.md does not mitigate the P1 above; the existing dCDH review-debt entries cover different items. References: TODO.md:L51-L60.

Security

No findings.

Documentation/Tests

Severity P3. Impact: Phase 3 adjusted estimands are still labeled inconsistently on the public surface. to_dataframe("event_study") now uses _horizon_label(), but summary() still hardcodes DID_l, and _estimand_label() still returns DID^{fd}_l (see linear_trends_effects) when both controls and trends_linear are active with L_max >= 2, dropping the X. That makes it harder to distinguish raw DID_l from DID^X_l, DID^{fd}_l, and DID^{X,fd}_l. Concrete fix: use _horizon_label() in the summary event-study table and preserve the combined ^{X,fd} label in _estimand_label(). References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L421-L430, diff_diff/chaisemartin_dhaultfoeuille_results.py:L433-L466, diff_diff/chaisemartin_dhaultfoeuille_results.py:L724-L739, diff_diff/chaisemartin_dhaultfoeuille_results.py:L998-L1007.
Severity P3. Impact: The new controls / trends tests cover happy-path behavior, but they still do not directly exercise the DID^X failed-first-stage branches (n_obs == 0 and n_obs < n_params) with another baseline still estimable, so the remaining exclusion bug is unguarded. Concrete fix: add targeted regression tests asserting failed strata are removed from N_l / N_pl_l while valid adjusted horizons remain finite. References: tests/test_chaisemartin_dhaultfoeuille.py:L2290-L2555, docs/methodology/REGISTRY.md:L612-L612.
I did not re-run the full test suite in this review environment.

Path to Approval

Thread a failed-stratum exclusion mask out of _compute_covariate_residualization() and apply it consistently to adjusted positive-horizon and placebo point-estimate / IF helpers before eligible, N_l, and N_pl_l are finalized.
Add regression tests for both DID^X failure branches (n_obs == 0 and n_obs < n_params) where another baseline remains estimable, and assert the valid adjusted event-study / placebo outputs stay finite while the failed stratum is excluded.
Fix the remaining Phase 3 results labels in summary() and _estimand_label() so DID^X, DID^{fd}, and DID^{X,fd} render consistently.

…bels P1: Failed DID^X first-stage strata now have N_mat zeroed out (not just Y_mat NaN'd). The downstream eligibility checks (N_mat[g,idx] > 0) in all DID/IF/placebo functions naturally exclude these groups from N_l/N_pl_l, preventing NaN poisoning of otherwise estimable horizons. P3: _estimand_label() now returns DID^{X,fd}_l (not DID^{fd}_l) when both controls and trends_linear active with L_max>=2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T19:29:10Z

/ai-review

github-actions · 2026-04-13T19:43:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d087f2152706b0d2e14718b5ccaab50d21640f31

Overall Assessment

⛔ Blocker

Executive Summary

The prior DID^X failed-first-stage exclusion issue looks materially improved: the explicit n_obs==0 / n_obs<n_params branches now feed failed_baselines and zero N_mat before the multi-horizon and placebo paths.
One blocker remains in the linear-trends extension: trends_linear=True is not propagated correctly through the existing normalization / aggregation surfaces.
The new DID^X validation still accepts non-finite controls; an Inf control can collapse the first stage after finite-row filtering and silently leave a baseline unadjusted.
The registry now documents the intended R deviations (cell-weight DID^X first stage, conservative cumulated DID^{fd} SE, partial heterogeneity, descriptive Design-2), so those items are informational only.
Public reporting is still only partially fixed: summary() keeps plain DID_l labels and a Cost-Benefit Delta header on adjusted trend summaries.

Methodology

Severity P0. Impact: trends_linear=True is not threaded through all existing aggregation outputs. The fit path first-differences Y_mat, so multi_horizon_dids[l]["did_l"] becomes the Section 1.3 second-difference object DID^{fd}_l, but _compute_normalized_effects() still normalizes that un-cumulated did_l, and the PR explicitly blanks cost_benefit_delta / overall_* for L_max>=2. Section 1.3/Lemma 6 says the level effect is recovered by cumulation and that the normalized event-study effects and cost-benefit ratio can be generalized similarly. So horizon-2+ results.normalized_effects are silently the wrong estimand, and the missing delta surface is an undocumented deviation from the source method. Concrete fix: compute both surfaces from the cumulated δ^{fd} path (or from per-group running sums before aggregation). If those aggregations are not shipping now, hard-disable both surfaces under trends_linear and add an explicit REGISTRY.md deviation note instead of returning a mislabeled normalized object. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1022, diff_diff/chaisemartin_dhaultfoeuille.py:L2075, diff_diff/chaisemartin_dhaultfoeuille.py:L2154, diff_diff/chaisemartin_dhaultfoeuille.py:L3845, tests/test_chaisemartin_dhaultfoeuille.py:L2554.
Severity P3. Impact: the cell-weight DID^X first stage, the conservative cumulated DID^{fd} SE, the partial heterogeneity implementation, the descriptive Design-2 wrapper, and the “per-period path stays raw under controls” behavior are now explicitly documented in the Methodology Registry, so those deviations are mitigated under the review rules. Concrete fix: none. References: docs/methodology/REGISTRY.md:L612, docs/methodology/REGISTRY.md:L614, docs/methodology/REGISTRY.md:L618, docs/methodology/REGISTRY.md:L620.

Code Quality

Severity P1. Impact: the DID^X input guard rejects NaN controls but not ±Inf. After pd.to_numeric, Inf survives validation, _compute_covariate_residualization() drops non-finite first-stage rows via finite_mask, and the n_obs==0-after-filter branch just continues without warning or failed_baselines bookkeeping. Cells with non-finite covariates are also left raw in the level residualization step. That means a non-finite control can silently mix adjusted and unadjusted cells or fall back to raw DID for an affected baseline, contrary to the registry note that failed first stages are excluded from estimation. Concrete fix: require np.isfinite on control columns up front, and route the n_obs==0-after-finite_mask branch through the same warning / failed_baselines exclusion path used for the explicit failure branches. References: diff_diff/chaisemartin_dhaultfoeuille.py:L648, diff_diff/chaisemartin_dhaultfoeuille.py:L2828, diff_diff/chaisemartin_dhaultfoeuille.py:L2929, docs/methodology/REGISTRY.md:L612.

Performance
No findings.

Maintainability

Severity P3. Impact: the public summary surface is still inconsistent for adjusted estimands. to_dataframe("event_study") now uses adjusted labels, but summary() still prints Event Study (DID_l, …) and DID_l rows for adjusted horizons, and its L_max>=2 overall block header remains Cost-Benefit Delta... even when the trend-adjusted path intentionally surfaces NaN there. Concrete fix: reuse _horizon_label() in the summary event-study block and make the overall header follow _estimand_label() / the actual overall surface. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L421, diff_diff/chaisemartin_dhaultfoeuille_results.py:L582, diff_diff/chaisemartin_dhaultfoeuille_results.py:L726.

Tech Debt
No findings. The current TODO.md entries do not track the P0/P1 above, so they remain unmitigated.

Security
No findings.

Documentation/Tests

Severity P3. Impact: the new tests cover parity for point effects under controls / trends, but they do not exercise the normalized_effects / cost_benefit_delta interaction under trends_linear or the non-finite-control DID^X failure path. The trend tests currently lock in cost_benefit_delta is None instead. Concrete fix: add regression tests for trends_linear normalized / delta outputs and for controls with np.inf (or another finite-filter-to-zero first-stage case). References: tests/test_chaisemartin_dhaultfoeuille.py:L2554, tests/test_chaisemartin_dhaultfoeuille_parity.py:L385.
I did not rerun the suite in this environment; python3 here does not have the project dependencies installed (numpy is missing).

Path to Approval

For trends_linear=True, stop computing normalized_effects from the raw DID^{fd}_l second differences. Either implement the generalized normalized / delta aggregations from the cumulated trend-adjusted path, or gate both surfaces off and document that limitation in REGISTRY.md.
Reject non-finite control values during DID^X validation, and make the finite_mask -> n_obs==0 branch warn and exclude the failed baseline exactly like the explicit n_obs==0 / n_obs<n_params branches.
Add regression coverage for both issues: one trends_linear case that checks normalized / delta behavior at l>=2, and one DID^X case with non-finite controls that must fail loudly or exclude the baseline.
Finish the remaining adjusted-label fixes in summary() so the printed surface matches the already-fixed to_dataframe() labels.

…guard P0: normalized_effects suppressed under trends_linear (was normalizing second-differences DID^{fd}_l instead of level effects). REGISTRY documents that normalized_effects and cost_benefit_delta are both unavailable under trends_linear. P1: Non-finite (Inf) control values now rejected with ValueError during DID^X validation (was silently collapsing first-stage OLS). P3: summary() event study header and row labels now use _horizon_label() (DID^X_l, DID^{fd}_l, DID^{X,fd}_l matching to_dataframe). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T19:49:11Z

/ai-review

github-actions · 2026-04-13T20:01:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e3d51db3b187db18be0176a69ce3f2180df78942

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior blocker on trends_linear normalization / cost-benefit output is resolved: those surfaces are now suppressed and documented in the registry.
The prior DID^X non-finite-control issue is also resolved: Inf controls are now rejected before residualization.
Severity P1: trends_nonparam does not reject missing set assignments; all-NA groups can pass validation and then get silently treated as having no same-set controls.
Severity P1: design2_effects does not propagate active adjustments consistently; controls are respected in the plain Design-2 path but silently dropped once trends_linear=True.
Severity P2: controls=[] falls through to an internal np.stack([]) failure instead of a clear API-level validation error.
Severity P3: trend-adjusted public reporting is still partially inconsistent; summary() still presents a Cost-Benefit Delta block even when that surface is intentionally suppressed.

Methodology

Severity P1. Impact: the new state-set trend path does not validate missing set membership. trends_nonparam only checks column existence, time-invariance, and that the partition is coarser than group; it never rejects NA-only groups. Those groups then hit the same-set filter with NaN == NaN -> False, so they are silently treated as having no within-set controls and get excluded under a generic “no baseline-matched controls” warning. That is a missing assumption/input check on the Section 1.4 control-pool definition, and it changes control-group composition without a targeted error. Concrete fix: explicitly reject missing set assignments before building set_ids_arr, with counts/examples, and add a regression test for all-NA and mixed-NA set membership. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1071-L1111, diff_diff/chaisemartin_dhaultfoeuille.py:L3454-L3456, diff_diff/chaisemartin_dhaultfoeuille.py:L3589-L3591, docs/methodology/REGISTRY.md:L616-L616
Severity P1. Impact: the new Design-2 surface has inconsistent parameter propagation. In the results builder, design2_effects uses Y_mat when trends_linear=False, so controls flow through as residualized levels; but the same code path switches to raw y_pivot whenever trends_linear=True, which silently drops the requested covariate adjustment in the combined design2 + controls + trends_linear case. The registry note only documents that Design-2 is a descriptive convenience wrapper, not that active adjustments are conditionally ignored. This is an incomplete parameter-propagation / semantic-contract bug on a new public surface. Concrete fix: either carry the adjusted level-outcome surface through to Design-2 even when trends are enabled, or explicitly reject unsupported design2 + adjustment combinations with a clear error and document that contract. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2354-L2372, docs/methodology/REGISTRY.md:L620-L620
Severity P3. Impact: the previous trends_linear blocker is now mitigated correctly. normalized_effects and cost_benefit_delta are no longer exposed under trends_linear, and the registry documents that suppression as an intentional deviation. Concrete fix: none. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1523-L1537, diff_diff/chaisemartin_dhaultfoeuille.py:L2163-L2174, docs/methodology/REGISTRY.md:L614-L614

Code Quality

Severity P2. Impact: controls=[] is not handled explicitly. The empty list passes the early controls is not None gate, then the fit path reaches np.stack(X_pivots, axis=2) with no arrays and raises an opaque internal error instead of a user-facing validation message. This is a new edge-case failure path on the public API. Concrete fix: treat an empty list as None or raise an early ValueError("controls must contain at least one column name"), and add a regression test. References: diff_diff/chaisemartin_dhaultfoeuille.py:L633-L673, diff_diff/chaisemartin_dhaultfoeuille.py:L972-L980

Performance

No findings.

Maintainability

Severity P3. Impact: the reporting surface is still partially inconsistent for trends_linear with L_max >= 2. summary() still centers an overall Cost-Benefit Delta block even though delta is intentionally suppressed in that mode, and it never prints the actual cumulated level effects from linear_trends_effects. The underlying estimator contract is now safe, but the summary surface still nudges users toward the wrong top-level object. Concrete fix: make the overall block explicitly say that no aggregate delta is reported under trends_linear, and add a separate linear_trends_effects table to summary(). References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L582-L590, diff_diff/chaisemartin_dhaultfoeuille_results.py:L726-L748

Tech Debt

No findings. TODO.md does not already track the P1 items above, so they remain unmitigated. Reference: TODO.md:L51-L79

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new tests cover the happy-path state-set, trend, and Design-2 scenarios, but they do not cover the two blocking interaction cases above: missing trends_nonparam set IDs, and design2 combined with active adjustments (controls, trends_linear, or both). That leaves the new P1 paths unguarded. Concrete fix: add regression tests for an NA set-membership column, and for design2=True with controls, with trends_linear, and with both together. References: tests/test_chaisemartin_dhaultfoeuille.py:L2634-L2694, tests/test_chaisemartin_dhaultfoeuille.py:L2538-L2552, tests/test_chaisemartin_dhaultfoeuille.py:L2838-L2851

I did not rerun the test suite in this environment.

Path to Approval

Reject missing / NA set assignments in trends_nonparam before constructing set_ids_arr, and add a regression test that proves the path now fails loudly instead of silently excluding those groups.
Define a single contract for design2 under active adjustments: either compute it from the adjusted level-outcome surface when controls / trends_linear are active, or reject unsupported combinations with a clear ValueError. Add regression coverage for design2=True with controls, with trends_linear, and with both together.

P1: trends_nonparam now rejects NaN/missing set assignments with ValueError. P1: design2_effects always uses raw level outcomes from y_pivot (not residualized or first-differenced Y_mat). P2: controls=[] now raises ValueError instead of crashing on np.stack([]). P3: summary() overall block labeled "N/A under trends_linear" when trends + L_max>=2 (was "Cost-Benefit Delta" with NaN value). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T20:38:02Z

/ai-review

github-actions · 2026-04-13T20:45:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8a57c5df34a85cce09c71863187801dac8c6a028

Overall Assessment

⚠️ Needs changes

Executive Summary

Static review indicates the prior trends_nonparam missing-set blocker is addressed: missing set IDs are now rejected before set_ids_arr is built.
The prior Design-2 propagation blocker remains unresolved, and it is broader now: design2_effects still ignores controls, trends_linear, and trends_nonparam.
The DID^X cell-weighted first stage, the DID^{fd} conservative cumulated SE, and the general “Design-2 is a convenience wrapper” deviations are documented in REGISTRY.md, so I did not count those documented deviations as defects.
I did not find new P0 NaN/inference inconsistencies in the added DID^X / DID^{fd} / state-set / heterogeneity code paths.
This was a static review only; I could not run the test suite in this shell because pytest/numpy are unavailable.

Methodology

Severity P1. Impact: design2_effects is still computed from raw level outcomes and does not propagate active adjustment options or same-set restrictions. fit() always passes raw y_pivot.to_numpy() into _compute_design2_effects(), and the helper itself has no way to consume residualized outcomes, first-differenced outcomes, or set IDs. That means a fit with design2=True plus controls, trends_linear, or trends_nonparam produces a mixed results object: the main estimator is adjusted, but the Design-2 block is not. The current registry note documents Design-2 as a descriptive convenience wrapper, but it does not explicitly document that these active options are ignored. This is an incomplete parameter-propagation / semantic-contract defect on a new public surface. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2375-L2387, diff_diff/chaisemartin_dhaultfoeuille.py:L3147-L3222, docs/methodology/REGISTRY.md:L620-L620. Concrete fix: either reject design2=True whenever any of controls, trends_linear, or trends_nonparam is active, with an explicit ValueError plus a registry note, or recompute the Design-2 block from the same adjusted outcome surface and thread the same-set restriction through the helper.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: summary() now suppresses the misleading aggregate under trends_linear, but it still does not surface the new Phase 3 outputs (linear_trends_effects, heterogeneity_effects, design2_effects). Those are only discoverable by inspecting attributes or calling to_dataframe(), which leaves the primary reporting path incomplete for a large feature addition. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L582-L790, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1054-L1074. Concrete fix: add dedicated summary blocks, or at minimum an explicit summary note pointing users to to_dataframe(level="linear_trends"), to_dataframe(level="heterogeneity"), and results.design2_effects.

Tech Debt

No findings. TODO.md does not already track the P1 above, so it remains unmitigated. Reference: TODO.md:L51-L79.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the added tests cover the happy paths and the new parity scenarios, but they still do not cover design2=True with controls, trends_linear, or trends_nonparam, and there is no regression test for the new missing-set validation branch. That leaves both the unresolved Design-2 contract issue and the new set-ID guard unprotected against regression. References: tests/test_chaisemartin_dhaultfoeuille.py:L2634-L2694, tests/test_chaisemartin_dhaultfoeuille.py:L2814-L2885, tests/test_chaisemartin_dhaultfoeuille_parity.py:L365-L440. Concrete fix: add explicit regression tests for each design2 interaction path; if the intended contract is rejection, assert the exact ValueError. Add one test with missing trends_nonparam assignments.

Path to Approval

Define a single supported contract for design2 under active adjustments. Either make design2 consume the adjusted estimator surface and same-set restriction, or reject design2=True when controls, trends_linear, or trends_nonparam is active.
Add regression coverage for design2=True with controls, trends_linear, and trends_nonparam, plus one missing-set-membership test for trends_nonparam.

P1: REGISTRY now explicitly documents that design2_effects always uses raw (unadjusted) outcomes regardless of active controls/trends/nonparam. For full adjusted Design-2 estimation, the paper recommends running the standard estimator on a restricted subsample with trends_nonparam. P3: Added test_nan_set_membership_rejected for trends_nonparam NaN guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T20:56:47Z

/ai-review

github-actions · 2026-04-13T21:05:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4da0841a6f00352035465b657be78784a2cbbf2d

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous Design-2 propagation blocker is resolved: this PR now explicitly documents design2 as a raw descriptive convenience wrapper, so the remaining mismatch is no longer undocumented.
The previous trends_nonparam missing-set-membership blocker is also resolved: NaN set assignments are now rejected and covered by a regression test.
New blocker: trends_nonparam still lacks the paper’s set-support check, so it can silently change the estimand by dropping unsupported switcher/horizon pairs instead of rejecting or documenting the unsupported design.
I did not find new P0 NaN/inference inconsistencies in the added DID^X, DID^{fd}, heterogeneity, or Design-2 code paths.
Static review only: I could not execute the test suite in this shell because importing the package fails without numpy.

Methodology

Severity P1. Impact: the state-set-trends implementation (trends_nonparam) only validates that the set column exists, is time-invariant, and is coarser than group, then relies on the same-set control restriction to silently drop switcher/horizon pairs when no within-set controls remain. See diff_diff/chaisemartin_dhaultfoeuille.py:L1077 and diff_diff/chaisemartin_dhaultfoeuille.py:L3475. The paper’s state-set extension requires Assumption 14, including a common last untreated period across sets; otherwise the target parameters must be redefined rather than estimated as the paper’s (δ_{+,ℓ}). REGISTRY.md documents the same-set restriction but not this support-condition omission. See docs/methodology/REGISTRY.md:L616. Concrete fix: compute set-specific support (T_u^s) up front and either reject panels that violate Assumption 14, or explicitly document and warn that the library trims to horizon-specific within-set support and therefore estimates a different parameter. (nber.org)
Severity P3. Impact: the prior Design-2 blocker is now mitigated. The code feeds raw levels into the Design-2 block at diff_diff/chaisemartin_dhaultfoeuille.py:L2375, and the deviation is explicitly documented at docs/methodology/REGISTRY.md:L620. Concrete fix: none required.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: summary() still does not surface linear_trends_effects, heterogeneity_effects, or design2_effects, so large parts of the new Phase 3 surface remain discoverable only through attribute inspection or to_dataframe() for two of the three outputs. See diff_diff/chaisemartin_dhaultfoeuille_results.py:L506 and diff_diff/chaisemartin_dhaultfoeuille_results.py:L1054. Concrete fix: add a short summary footer pointing users to to_dataframe("linear_trends"), to_dataframe("heterogeneity"), and design2_effects, and consider a to_dataframe("design2") level.

Tech Debt

No findings. I did not see this methodology issue tracked in TODO.md:L51, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the added tests cover basic trends_nonparam validation, but there is still no regression test for the paper-critical case where untreated support ends at different horizons across sets. Existing coverage starts at tests/test_chaisemartin_dhaultfoeuille.py:L2613 and does not pin the unequal-T_u^s branch. Concrete fix: add a two-set panel where one set loses untreated controls earlier than the other and assert the chosen reject-or-document behavior.

Path to Approval

Add a fit-time trends_nonparam support check for Assumption 14, especially equality of set-specific last-untreated horizons (T_u^s) across sets; reject unsupported panels or explicitly redefine/document the estimand.
Add a regression test covering unequal within-set support horizons, so the chosen behavior is pinned and future changes cannot silently reintroduce the mismatch.

P1: REGISTRY now documents that trends_nonparam does not enforce Assumption 14 (common last-untreated period across sets) up front. When within-set controls are exhausted at a given horizon, affected switcher/horizon pairs are excluded via the existing empty-control-pool mechanism. The effective estimand is trimmed to within-set support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T21:24:35Z

/ai-review

github-actions · 2026-04-13T21:34:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d20efde1c20dab8f41dc23dc490321aeb9b7afe1

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior trends_nonparam common-support blocker is now mitigated: REGISTRY.md explicitly documents the Assumption 14 omission as a trimmed-support deviation, so it is no longer a defect.
The prior Design-2 documentation blocker is also mitigated: REGISTRY.md now labels design2 as a descriptive raw-outcome convenience wrapper.
New blocker: trends_linear=True is applied before the legacy per_period_effects path, so results.per_period_effects / to_dataframe("per_period") no longer represent the Phase 1 per-period DID decomposition. With first-differenced outcomes, later periods become second-difference objects, and a first-post-baseline switch can yield NaN because Z[:,0] is explicitly NaN.
I did not find new P0 NaN/inference anti-patterns in the added SE/inference code; the new branches consistently use safe_inference().
Static review only: I could not execute the test suite in this shell because python fails to import numpy.

Methodology

Affected methods in this PR are the paper’s covariate-adjusted DID (DID^X), group-specific linear-trends DID (DID^{fd} / Lemma 6), state-set trends, and heterogeneity testing extensions from the Web Appendix. (rdocumentation.org)

Severity P1. Impact: trends_linear=True first-differences Y_mat at diff_diff/chaisemartin_dhaultfoeuille.py:1067, but fit() still feeds that transformed matrix into _compute_per_period_dids() whenever controls is None at diff_diff/chaisemartin_dhaultfoeuille.py:1150. _compute_per_period_dids() then differences adjacent columns again at diff_diff/chaisemartin_dhaultfoeuille.py:2559, while _compute_first_differenced_matrix() sets Z[:,0] = NaN at diff_diff/chaisemartin_dhaultfoeuille.py:2999. So the public per_period_effects surface silently stops being Phase 1 DID_{+,t} / DID_{-,t} under trends_linear; instead it becomes a second-difference object, and the first switch window can become NaN. REGISTRY.md explicitly carves out the raw per_period_effects exception for controls at docs/methodology/REGISTRY.md:612, but there is no corresponding documented contract for trends_linear at docs/methodology/REGISTRY.md:614. Concrete fix: either keep per_period_effects on raw outcomes whenever trends_linear=True (matching the DID^X carve-out), or suppress/rename that surface under trends_linear and document/test the new semantics.
Severity P3. Impact: the previous trends_nonparam Assumption 14/common-support finding is now mitigated by the new registry note at docs/methodology/REGISTRY.md:616. Concrete fix: none required.
Severity P3. Impact: the previous Design-2 mismatch is now mitigated by the explicit “descriptive convenience wrapper” note at docs/methodology/REGISTRY.md:620. Concrete fix: none required.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: design2_effects is now a public results attribute at diff_diff/chaisemartin_dhaultfoeuille_results.py:409, but unlike linear_trends_effects and heterogeneity_effects, it still has no to_dataframe() entry point in diff_diff/chaisemartin_dhaultfoeuille_results.py:841 and no summary surfacing in diff_diff/chaisemartin_dhaultfoeuille_results.py:506. Concrete fix: add to_dataframe("design2") and a short summary/footer pointer so the new surface is discoverable without attribute inspection.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new trends_nonparam tests start at tests/test_chaisemartin_dhaultfoeuille.py:2613, but they still do not pin the newly documented unequal-support trimming branch described at docs/methodology/REGISTRY.md:616. That leaves the trimmed-support behavior unguarded against regression. Concrete fix: add a two-set panel where one set exhausts untreated controls earlier than the other, then assert the warning and the horizon-specific exclusion behavior.

Static review only: I could not run the suite because python in this environment raises ModuleNotFoundError: numpy.

Path to Approval

Make results.per_period_effects / to_dataframe("per_period") consistent under trends_linear=True: either keep them on raw outcomes or suppress/rename them so they no longer advertise Phase 1 did_plus_t / did_minus_t quantities while using first-differenced outcomes.
Add regression tests covering trends_linear=True for a panel with a first-post-baseline switch, plus the controls + trends_linear branch, and assert the intended per-period contract so the surface cannot silently flip between raw and second-difference behavior.

P1: When trends_linear=True without controls, _compute_per_period_dids and _compute_cohort_recentered_inputs now use raw y_pivot outcomes (not the first-differenced Z_mat). Previously the per-period path would double-difference the already-differenced outcomes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T21:37:49Z

/ai-review

github-actions · 2026-04-13T21:54:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cd73917dea80c62f7f90d84697ce4c7e0cd25213

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 on trends_linear=True corrupting the public per-period surface is resolved: _compute_per_period_dids() now receives raw outcomes/counts instead of the first-differenced matrix, so per_period_effects no longer silently become second-difference objects. See diff_diff/chaisemartin_dhaultfoeuille.py:1143 and diff_diff/chaisemartin_dhaultfoeuille.py:1152.
I did not find any new unmitigated P0/P1 issues in the DID^X, trends_lin, trends_nonparam, heterogeneity, or Design-2 branches.
P3: design2_effects is still only exposed as a raw attribute; there is no to_dataframe("design2") path or summary surfacing.
P3: the public results docs are not fully updated for the new Phase 3 surfaces, and _compute_covariate_residualization() still has a stale return annotation.
Static review only: I could not execute tests here because this shell lacks numpy and pytest.

Methodology

No unmitigated findings. The affected methods are DID^X covariate adjustment, trends_lin, trends_nonparam, and predict_het/heterogeneity. The reference package manual describes controls as first-difference residualization with time fixed effects estimated on not-yet-switched cells and separately by baseline treatment, trends_lin as computing effects on first differences and summing them back to levels, trends_nonparam as restricting controls to the same time-invariant set, and predict_het as including placebo regressions and a joint test. Against that source material and the registry, this PR’s remaining departures read as documented limitations, not blockers. (cran.r-universe.dev)

The paper/manual also support the PR’s treatment of trend-adjusted effects as first-difference estimands cumulated back to levels and note the equal-support requirement for set-specific trends. The registry now documents the finite-sample SE normalization gap, the trimmed-support behavior under trends_nonparam, the partial heterogeneity implementation, and the Design-2 convenience-wrapper scope, so those are mitigated P3 informational items rather than defects. See docs/methodology/REGISTRY.md:612, docs/methodology/REGISTRY.md:614, docs/methodology/REGISTRY.md:616, docs/methodology/REGISTRY.md:618, and docs/methodology/REGISTRY.md:620. (cran.r-universe.dev)

Code Quality

Severity P3. Impact: _compute_covariate_residualization() is annotated as returning a 2-tuple, but it actually returns three values (Y_resid, diagnostics, failed_baselines), which will mislead type checkers and future refactors. Concrete fix: update the return annotation and docstring to include failed_baselines. See diff_diff/chaisemartin_dhaultfoeuille.py:2760 and diff_diff/chaisemartin_dhaultfoeuille.py:2963.

Performance

No findings.

Maintainability

Severity P3. Impact: design2_effects is now a public results field, but it still has no to_dataframe("design2") branch and no summary() surfacing, so users have to discover it by direct attribute access. Concrete fix: add to_dataframe("design2") and a short summary/footer pointer. See diff_diff/chaisemartin_dhaultfoeuille_results.py:409, diff_diff/chaisemartin_dhaultfoeuille_results.py:506, and diff_diff/chaisemartin_dhaultfoeuille_results.py:841.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the documented trends_nonparam unequal-support trimming branch still has no targeted regression test; the new tests validate basic set behavior but not the Assumption-14 support-loss path described in the registry. Concrete fix: add a two-set panel where one set exhausts untreated controls earlier, then assert the warning and the horizon-specific exclusion behavior. See docs/methodology/REGISTRY.md:616 and tests/test_chaisemartin_dhaultfoeuille.py:2613.
Severity P3. Impact: the public results docs lag the implementation. The dataclass docstring still describes Phase 3 outputs as placeholders, and to_dataframe()’s docstring omits the new heterogeneity and linear_trends levels. Concrete fix: update the docstrings in the results class. See diff_diff/chaisemartin_dhaultfoeuille_results.py:322, diff_diff/chaisemartin_dhaultfoeuille_results.py:324, diff_diff/chaisemartin_dhaultfoeuille_results.py:847, and diff_diff/chaisemartin_dhaultfoeuille_results.py:1054.

Static review only: python here raises ModuleNotFoundError: numpy, and pytest is unavailable, so I could not run the added test cases.

- Fix _compute_covariate_residualization return type annotation to include failed_baselines set (was 2-tuple, now 3-tuple) - Add to_dataframe("design2") level for Design-2 results - Update to_dataframe() docstring with all new levels - Update results dataclass docstring: replace Phase 3 placeholder text with actual field descriptions for covariate_residuals, linear_trends_effects, heterogeneity_effects, design2_effects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T21:57:39Z

/ai-review

github-actions · 2026-04-13T22:05:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bfed3e649218f13263d56cdb3471451962e19ce7

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items are mostly addressed: the _compute_covariate_residualization() return annotation is fixed at diff_diff/chaisemartin_dhaultfoeuille.py:2760, and the Phase 3 result fields / to_dataframe() levels are now documented and exposed at diff_diff/chaisemartin_dhaultfoeuille_results.py:322 and diff_diff/chaisemartin_dhaultfoeuille_results.py:849.
I did not find an unmitigated P0/P1 methodology error in the DID^X, DID^{fd}, trends_nonparam, heterogeneity, or Design-2 branches. The remaining deviations claimed in the PR are explicitly documented in the registry at docs/methodology/REGISTRY.md:586, docs/methodology/REGISTRY.md:612, docs/methodology/REGISTRY.md:614, docs/methodology/REGISTRY.md:616, docs/methodology/REGISTRY.md:618, and docs/methodology/REGISTRY.md:620.
Residual P3: summary() still does not surface the new Phase 3 result blocks, so those outputs are discoverable via attributes / to_dataframe() but not via the primary human-readable renderer. See diff_diff/chaisemartin_dhaultfoeuille_results.py:514.
Residual P3: the trends_nonparam unequal-support trim/warning behavior documented in the registry still lacks a targeted regression test. See docs/methodology/REGISTRY.md:616 and tests/test_chaisemartin_dhaultfoeuille.py:2613.
Static review only: this environment does not have numpy, pandas, or pytest, so I could not execute the estimator or the added tests.

Methodology

No findings. Cross-checking the changed estimator branches against the registry notes shows the material departures are documented rather than accidental: finite-sample SE normalization vs R, raw per_period_effects under controls, conservative cumulated DID^{fd} SEs, trimmed-support behavior under trends_nonparam, partial heterogeneity scope, and descriptive Design-2 scope. See docs/methodology/REGISTRY.md:586, docs/methodology/REGISTRY.md:612, docs/methodology/REGISTRY.md:614, docs/methodology/REGISTRY.md:616, docs/methodology/REGISTRY.md:618, and docs/methodology/REGISTRY.md:620.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: summary() still omits the new Phase 3 surfaces (covariate_residuals, linear_trends_effects, heterogeneity_effects, design2_effects), even though to_dataframe() now exposes them. That leaves the main renderer behind the public results API. Concrete fix: add short summary sections or at least a footer pointer when any of those fields are populated. See diff_diff/chaisemartin_dhaultfoeuille_results.py:514, diff_diff/chaisemartin_dhaultfoeuille_results.py:849, diff_diff/chaisemartin_dhaultfoeuille_results.py:1070, and diff_diff/chaisemartin_dhaultfoeuille_results.py:1092.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the documented trends_nonparam support-loss path still is not regression-tested; the added TestStateSetTrends block covers validation and a basic fit, but not the unequal-support trimming/warning branch or its placebo/inference consequences. Concrete fix: add a two-set panel where same-set controls exist at l=1 but exhaust at l=2, then assert the warning, reduced N_l, and matching behavior in the main and placebo helper paths. See docs/methodology/REGISTRY.md:616, tests/test_chaisemartin_dhaultfoeuille.py:2613, diff_diff/chaisemartin_dhaultfoeuille.py:3473, and diff_diff/chaisemartin_dhaultfoeuille.py:3833.
Severity P3. Impact: ROADMAP.md still marks item 3i (“Parity tests vs did_multiplegt_dyn for covariate and extension specifications”) as “Not started” even though this PR adds those parity fixtures/tests. Concrete fix: update that status row. See ROADMAP.md:178 and tests/test_chaisemartin_dhaultfoeuille_parity.py:367.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 13, 2026 12:35

igerber added the ready-for-ci Triggers CI test workflows label Apr 13, 2026

Update ROADMAP: mark 3i parity tests as shipped

3e6611f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit d801cc1 into main Apr 14, 2026
19 checks passed

igerber deleted the dcdh-phase3b-covariates-trends branch April 14, 2026 00:39

igerber mentioned this pull request Apr 14, 2026

Release v3.1.0 #304

Merged

Conversation

igerber commented Apr 13, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026