Skip to content

Add Phase 3 PR B: covariates, trends, and extensions for dCDH#302

Merged
igerber merged 17 commits intomainfrom
dcdh-phase3b-covariates-trends
Apr 14, 2026
Merged

Add Phase 3 PR B: covariates, trends, and extensions for dCDH#302
igerber merged 17 commits intomainfrom
dcdh-phase3b-covariates-trends

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 13, 2026

Summary

  • Implement DID^X covariate residualization (Web Appendix Section 1.2, Assumption 11) via per-baseline OLS on not-yet-treated observations with level residualization (FWL theorem)
  • Implement DID^{fd} group-specific linear trends (Section 1.3, Lemma 6) via Z_mat first-differencing with per-group cumulation for level effects
  • Implement state-set-specific trends (Section 1.4) via control-pool restriction in both point estimate and IF computation
  • Implement heterogeneity testing (Section 1.5, Lemma 7) via saturated OLS with cohort dummies
  • Implement Design-2 switch-in/switch-out descriptive convenience wrapper (Section 1.6)
  • Add 3 R parity scenarios (controls, trends_lin, combined) with golden values from DIDmultiplegtDYN v2.3.3
  • Address AI review P1/P2 findings: isolate per-period path from DID^X residualization, add to_dataframe levels, vectorize residualization, add rank-deficiency guard

Methodology references (required if estimator / math changes)

  • Method name(s): ChaisemartinDHaultfoeuille (dCDH) - DID^X, DID^{fd}, state-set trends, heterogeneity testing, Design-2
  • Paper / source link(s): de Chaisemartin & D'Haultfoeuille (2024), "Difference-in-Differences Estimators of Intertemporal Treatment Effects," NBER WP 29873 - Web Appendix Sections 1.2-1.6
  • Any intentional deviations from the source (and why):
    • SE ~4% smaller than R DIDmultiplegtDYN: Python uses paper's Section 3.7.3 N_l normalization; R uses G (total groups). Both asymptotically valid; documented in REGISTRY.md
    • Cumulated DID^{fd} SE uses conservative upper bound (sum of per-horizon SEs); cross-horizon covariance formula is a library extension (paper proves per-horizon only). Documented in REGISTRY.md
    • Design-2 is a descriptive convenience wrapper, not full paper re-estimation with specialized control pools. Documented in REGISTRY.md
    • per_period_effects stays unadjusted when controls are active (DID^X applies only to per-group path). Documented in REGISTRY.md

Validation

  • Tests added/updated: tests/test_chaisemartin_dhaultfoeuille.py (25 new tests), tests/test_chaisemartin_dhaultfoeuille_parity.py (3 new R parity tests)
  • R parity: 3 new scenarios generated via DIDmultiplegtDYN v2.3.3 (controls 0.3%, trends exact, combined 0.6% point estimate gaps)
  • 152 tests pass, 0 failures

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 2 commits April 13, 2026 12:35
Implements ROADMAP items 3a-3e and 3i:
- DID^X covariate residualization (Web Appendix Section 1.2)
- DID^{fd} group-specific linear trends (Section 1.3, Lemma 6)
- State-set-specific trends (Section 1.4)
- Heterogeneity testing (Section 1.5, Lemma 7)
- Design-2 switch-in/switch-out convenience wrapper (Section 1.6)
- R parity tests for controls and trends_lin scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes:
- DID^X residualization no longer leaks into per-period path:
  per_period_effects uses raw Y_mat, only multi-horizon path
  sees residualized outcomes
- Added to_dataframe levels for heterogeneity and linear_trends

P2 fixes:
- Covariate coercion no longer mutates caller's DataFrame
- Vectorized residualization (einsum replaces nested loop)
- Heterogeneity test guards against rank-deficient OLS
- Added estimand contract test for controls + L_max=1
- REGISTRY note clarifies per_period_effects stays unadjusted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Executive Summary

  • The new DID^{fd} cumulation path can emit finite se/p_value/CI even when one of the component horizon SEs is NaN, which violates the library’s NaN-consistency contract and can understate uncertainty.
  • For trends_linear=True with L_max >= 2, the top-level overall_* surface is still built from the Phase 2 delta on second-difference DID^{fd}_l, but the results object relabels it as delta^{fd}. That silently changes the reported estimand.
  • The new OLS helper calls hardcode rank_deficient_action="warn", so the DID^X and heterogeneity paths no longer honor the estimator’s existing rank-deficiency contract.
  • The added tests exercise linear_trends_effects parity, but not the top-level trend-adjusted overall_* surface or cumulated NaN propagation, so the two issues above are currently unguarded.

Methodology

  • Severity: P0. Impact: linear_trends_effects can report finite inference when the cumulative SE is actually undefined. In the DID^{fd} cumulation block, non-finite component SEs are filtered out before summation, so a horizon with SE=NaN is silently dropped from the upper-bound calculation; if all component SEs are non-finite, the stored SE becomes 0.0. That violates the project’s all-or-nothing NaN rule for inference and can materially understate uncertainty in trend-adjusted outputs. See diff_diff/chaisemartin_dhaultfoeuille.py:2052 and docs/methodology/REGISTRY.md:614. Concrete fix: when building linear_trends_effects[l], require every component horizon 1..l to have a finite SE; otherwise set se, t_stat, p_value, and conf_int to NaN. Also store NaN rather than 0.0 when the cumulative SE is undefined.
  • Severity: P1. Impact: for trends_linear=True and L_max >= 2, the code still computes cost_benefit_delta from multi_horizon_dids before cumulating DID^{fd}_l into level effects, but the results surface then relabels the top-level estimand as delta^{fd}. That mixes second-difference inputs with a level-effect label, so results.overall_att, summary(), and to_dataframe("overall") can report the wrong estimand with no warning. See diff_diff/chaisemartin_dhaultfoeuille.py:1476, diff_diff/chaisemartin_dhaultfoeuille.py:1931, and diff_diff/chaisemartin_dhaultfoeuille_results.py:421. Concrete fix: either compute a properly defined trend-adjusted aggregate from the cumulated level path, or stop relabeling the top-level overall surface as delta^{fd} until that aggregate exists.

Code Quality

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings beyond the test/doc gap below; none of the substantive issues above are tracked in TODO.md:51.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests do not cover the two most fragile DID^{fd} surfaces. The unit suite only checks that linear_trends_effects cumulates correctly and that overall_att is finite, while the parity suite explicitly compares R only against linear_trends_effects, not the top-level overall_* surface. That leaves the mislabeled delta^{fd} summary and the NaN-propagation bug unguarded. See tests/test_chaisemartin_dhaultfoeuille.py:2489 and tests/test_chaisemartin_dhaultfoeuille_parity.py:403. Concrete fix: add one regression test for overall_att/overall_se/estimand under trends_linear=True, L_max>=2, and one NaN-consistency test where a component horizon SE is NaN and linear_trends_effects[l] must also be all-NaN.
  • Severity: P3. Impact: the public fit() docstring still says controls, trends_linear, and trends_nonparam are “Reserved for Phase 3” and raise NotImplementedError, which is now false. See diff_diff/chaisemartin_dhaultfoeuille.py:527. Concrete fix: update the parameter docs to describe the shipped contracts and prerequisites.

Path to Approval

  1. Make linear_trends_effects NaN-consistent: if any component SE up to horizon l is non-finite, set the cumulative se/t_stat/p_value/conf_int to NaN and add a regression test.
  2. Fix the trends_linear=True, L_max>=2 top-level overall surface so it reports a real trend-adjusted aggregate instead of relabeling the old Phase 2 delta; add a regression test for overall_att/overall_se/estimand.
  3. Propagate rank_deficient_action into the new OLS helper paths and add tests for rank_deficient_action="error" on DID^X and heterogeneity fits.

…nt_action

P0: Cumulated DID^{fd} SE now requires ALL component SEs to be finite;
    non-finite SE at any horizon propagates NaN (was silently dropped).

P1: trends_linear + L_max>=2 overall surface now reports cumulated level
    effects from linear_trends_effects (was second-difference delta).
    cost_benefit_delta suppressed under trends_linear (meaningless on
    second-differences).

P2: rank_deficient_action threaded through _compute_covariate_residualization
    and _compute_heterogeneity_test (was hardcoded "warn").

P3: fit() docstrings updated for controls, trends_linear, trends_nonparam
    (were stale "Reserved for Phase 3" text).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2c6cabfd5922da4358ea627b4ff837825a5fd852


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Previous re-review blockers around cumulated NaN propagation and rank_deficient_action handling appear resolved in the changed code.
  • Severity P1: trends_linear=True still exposes a top-level overall_* scalar for L_max >= 2, but the reference did_multiplegt_dyn(..., trends_lin=TRUE) contract says the average total effect is not computed in that mode. The PR now hard-codes overall_* = linear_trends_effects[max_h] and labels it as delta/cost-benefit output. (rdrr.io)
  • Severity P1: the new Section 1.5 heterogeneity path does not match the reference contract. Official docs say predict_het includes placebo regression tables plus a joint null test, and it cannot be combined with controls; this PR returns only post-treatment coefficients and silently allows controls. (rdrr.io)
  • The newly added REGISTRY notes for the R SE normalization gap, conservative cumulated DID^{fd} SE, Design-2 as a convenience wrapper, and the unadjusted per-period path under controls are documented deviations and should not count against the PR.

Methodology

  • Severity: P1. Impact: the trends_linear=True, L_max>=2 results surface still reports a scalar “overall” estimand even though the reference method does not compute an average total effect in this mode. The implementation suppresses cost_benefit_delta, then repurposes overall_* to the last-horizon cumulative effect and labels it as delta^{fd} / “Cost-Benefit Delta”, while the REGISTRY only documents horizon-specific cumulated effects in results.linear_trends_effects. This is an undocumented methodology deviation that can mislead downstream consumers of results.overall_att, summary(), and to_dataframe("overall"). References: diff_diff/chaisemartin_dhaultfoeuille.py:L2107-L2121, diff_diff/chaisemartin_dhaultfoeuille_results.py:L421-L449, diff_diff/chaisemartin_dhaultfoeuille_results.py:L563-L571, docs/methodology/REGISTRY.md:L614-L620, tests/test_chaisemartin_dhaultfoeuille.py:L2553-L2566. Concrete fix: when trends_linear=True and L_max>=2, make overall_* unavailable/NaN (or move the last-horizon cumulative effect to a separately named field), and stop labeling it as cost-benefit delta in summary()/to_dataframe(). (rdrr.io)
  • Severity: P1. Impact: the shipped heterogeneity option is not the Section 1.5 contract it claims to implement. The reference predict_het path includes placebo regressions and a joint null test, and it explicitly disallows the controls option; this PR instead computes only post-treatment regressions, never exposes placebo/joint-test outputs, and silently allows controls, with internally inconsistent behavior because controls+heterogeneity uses residualized outcomes unless trends_linear=True, in which case it falls back to raw levels. That is both a missing assumption check and an undocumented methodology deviation. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2127-L2162, diff_diff/chaisemartin_dhaultfoeuille.py:L2861-L2993, docs/methodology/REGISTRY.md:L618-L620. Concrete fix: reject heterogeneity when controls is supplied, add the placebo-regression / joint-test surface required by the source contract, and either propagate other allowed options (trends_nonparam, trends_linear) through the heterogeneity path or reject those combinations explicitly. (rdrr.io)
  • Severity: P3. Impact: the PR’s explicit REGISTRY notes for DID^X per-period behavior, conservative cumulated DID^{fd} SEs, Design-2 scope, and the R SE normalization gap are sufficient mitigation; I would not count those as defects. References: docs/methodology/REGISTRY.md:L612-L620. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the methodology surface overloading noted above.

Tech Debt

  • No findings beyond the blockers above; neither blocker is tracked in TODO.md under “Tech Debt from Code Reviews.”

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the public docs did not fully catch up with the new API surface. fit() adds heterogeneity and design2, but the parameter docstring stops before documenting them, and to_dataframe() now accepts heterogeneity / linear_trends while its fallback error text still lists only the older levels. References: diff_diff/chaisemartin_dhaultfoeuille.py:L487-L545, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1032-L1057. Concrete fix: document the new fit() parameters and update to_dataframe() help/error text to include the new supported levels.

Path to Approval

  1. Remove or rename the top-level overall_* surface for trends_linear=True, L_max>=2 so it no longer presents the last-horizon cumulative effect as cost-benefit delta / delta^{fd}; update summary(), to_dataframe("overall"), and the unit test at tests/test_chaisemartin_dhaultfoeuille.py:L2553-L2566.
  2. Bring heterogeneity in line with the Section 1.5 contract: reject controls, add placebo-regression outputs and the joint null test, and add regression tests for both the rejection path and the placebo/joint-test surface.
  3. Decide the allowed interaction set for heterogeneity with trends_nonparam / trends_linear; either propagate those options through the heterogeneity path or reject them explicitly, and add tests for every allowed combination.

…geneity

P1: trends_linear + L_max>=2 overall_* is now NaN (R does not compute
    an aggregate in trends_lin mode). Cumulated effects available via
    results.linear_trends_effects[l].

P1: heterogeneity + controls now raises ValueError (matching R's
    predict_het which disallows controls). REGISTRY documents
    heterogeneity as partial implementation (post-treatment only,
    no placebo regressions or joint null test).

P3: Added fit() docstrings for heterogeneity and design2 parameters.
P3: Updated to_dataframe() error text with new level names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6f1c99fa5bb81c2cf04146a6f6206fc48cca037f


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Previous blocker on trends_linear=True, L_max>=2 is mostly resolved numerically: the PR now suppresses the scalar overall_* surface instead of reusing the last-horizon cumulative effect, which matches the reference contract that trends_lin=TRUE does not compute an average total effect. citeturn0view0
  • Previous blocker on heterogeneity + controls is resolved, and the missing placebo/joint-test surface is now documented in the Methodology Registry; under the review rubric, that documented deviation is informational rather than blocking. citeturn0view2
  • Severity P1: heterogeneity still ignores trends_linear / trends_nonparam, so the heterogeneity table can diverge from the estimator actually fit.
  • Severity P1: heterogeneity is documented as requiring L_max >= 1, but fit(..., heterogeneity=...) silently no-ops when L_max=None.
  • Severity P1: design2=True silently returns no Design-2 output under the default drop_larger_lower=True, because the required 2-switch groups are dropped before the wrapper runs.
  • Remaining trends_linear labeling issues are P3 informational: the NaN overall row is still labeled as delta / cost-benefit delta.

Methodology

Affected methods in this PR are DID^X covariate adjustment (controls), DID^{fd} linear trends (trends_linear), state-set trends (trends_nonparam), heterogeneity testing (predict_het-style output), and the Design-2 join/leave wrapper. Those are the same Phase 3 extensions described in the reference package’s Sections 1.2-1.6. citeturn0view0turn0view1turn0view2turn0view3

Code Quality

No separate findings beyond the missing parameter-validation/propagation issues above.

Performance

No findings.

Maintainability

No separate findings beyond the API-contract drift noted below.

Tech Debt

No mitigating TODO.md entry exists for the P1 issues above, so they remain unmitigated. TODO.md only tracks older dCDH Phase 1 items in this area. Reference: TODO.md#L51.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Add an upfront ValueError when heterogeneity is passed without L_max, and add a regression test using the existing heterogeneity fixture that asserts the failure.
  2. Either reject heterogeneity with trends_linear / trends_nonparam, or plumb those adjustments into _compute_heterogeneity_test() so it operates on the same adjusted estimand as the fitted event study; add tests for every allowed combination.
  3. Enforce the design2 precondition by erroring/warning when design2=True and drop_larger_lower=True, or compute Design-2 from a pre-Step-6 side path; add a regression test on _make_join_then_leave_panel() under default constructor settings.

P1: heterogeneity without L_max now raises ValueError (was silent no-op).
P1: heterogeneity rejects trends_linear and trends_nonparam (would produce
    coefficients for wrong estimand since het test uses raw level changes).
P1: design2=True with drop_larger_lower=True now raises ValueError (the
    2-switch groups Design-2 needs are dropped by the default filter).
P3: NaN overall row under trends_linear+L_max>=2 now labeled as
    "DID^{fd}_l (see linear_trends_effects)" instead of "Cost-Benefit Delta".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3f7e72c0fac612bdb456e2506372930acb1599f2


Overall Assessment

⛔ Blocker

Executive Summary

  • Re-review status: the prior P1s around heterogeneity validation and the design2=True / drop_larger_lower precondition appear fixed.
  • Affected methodology in this PR: DID^X covariate residualization, DID^{fd} linear trends, state-set trends, heterogeneity testing, and the Design-2 wrapper.
  • Severity P0 [Newly identified]: the DID^X first-stage OLS drops a reference time dummy without adding an intercept, so theta_hat is estimated from the wrong model.
  • Severity P0 [Newly identified]: the heterogeneity regression drops a reference cohort dummy without adding an intercept, so beta^{het}_l is not the saturated OLS described in the registry/docstring.
  • Severity P1 [Newly identified]: the claimed DID^X rank-deficiency guard is incomplete; solve_ols() can return NaN coefficients for dropped controls, and the residualization path then feeds those NaNs straight into np.einsum, poisoning adjusted outcomes.

Methodology

  • Severity: P0 [Newly identified]. Impact: DID^X residualization is not fitting the documented “first-differenced outcomes on first-differenced covariates with time FEs” model. The helper drops one time dummy as a reference category but never adds the required intercept, and solve_ols() only includes an intercept if the caller supplies one. That forces the omitted time effect to zero and silently biases theta_hat, which then biases every downstream DID^X-adjusted estimate and SE. Concrete fix: build the first-stage design as either [intercept, dX, dropped_time_dummies] or [dX, full_time_dummy_set], then add a regression test where the omitted-period FE is nonzero. References: REGISTRY.md#L612, linalg.py#L446, chaisemartin_dhaultfoeuille.py#L2813.
  • Severity: P0 [Newly identified]. Impact: _compute_heterogeneity_test() is not the saturated Lemma 7 regression it claims to be. It drops one cohort dummy as a reference category but again omits the intercept, so the omitted cohort mean is constrained to zero. That makes the reported beta, se, and inference for heterogeneity_effects methodologically wrong with no warning. Concrete fix: include an intercept when dropping one cohort dummy, or keep the full cohort-dummy set with no intercept; add a hand-checkable test with two cohorts that have different means. References: REGISTRY.md#L618, linalg.py#L446, chaisemartin_dhaultfoeuille.py#L2990.
  • Severity: P1 [Newly identified]. Impact: the DID^X rank-deficiency path is unsafe. solve_ols() explicitly returns NaN for dropped coefficients under the default "warn" mode, but _compute_covariate_residualization() immediately uses theta_hat in np.einsum. A duplicated or collinear control can therefore turn every residualized outcome in that baseline cohort into NaN, which then propagates into the estimator. Concrete fix: if any control coefficient is NaN, either refit on the finite subset of retained controls or raise a clear ValueError before residualizing; add a regression test with duplicated controls. References: linalg.py#L460, chaisemartin_dhaultfoeuille.py#L2832, chaisemartin_dhaultfoeuille.py#L2852.

Code Quality

  • No separate findings beyond the estimator-spec bugs above.

Performance

  • No findings.

Maintainability

  • No separate findings beyond the estimator-spec bugs above.

Tech Debt

  • No matching TODO.md entry mitigates the P0/P1 issues above. Reference: TODO.md#L51.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix the DID^X first-stage regression so the time-FE model is identified correctly: add an intercept with dropped time dummies, or keep the full time-dummy set with no intercept.
  2. Fix the heterogeneity regression so the cohort-FE model is saturated correctly: add an intercept with dropped cohort dummies, or keep the full cohort-dummy set with no intercept.
  3. Add a real rank-deficiency guard in DID^X before residualization, so dropped control coefficients cannot propagate NaN through Y_resid.
  4. Add targeted tests for all three cases above: one DID^X regression with nonzero omitted-period FE, one heterogeneity regression with nonzero omitted-cohort mean, and one duplicated-control DID^X case.

P0: DID^X first-stage OLS now includes intercept when dropping one
    time dummy as reference (was forcing omitted period FE to zero).
    theta_hat extraction updated to skip intercept at index 0.

P0: Heterogeneity regression now includes intercept when dropping one
    cohort dummy as reference (was forcing omitted cohort mean to zero).
    beta_het extraction updated to read index 1 instead of 0.

P1: DID^X now guards against NaN control coefficients from rank-deficient
    OLS. If any theta_hat entry is NaN, residualization is skipped for
    that baseline with a UserWarning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 357a551a5a3df0478512a40beb887efe5d88e48e


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the prior omitted-intercept P0s appear fixed; both the DID^X first stage and the heterogeneity regression now include an intercept before dropping a reference dummy. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2813, diff_diff/chaisemartin_dhaultfoeuille.py#L3008
  • The prior DID^X rank-deficiency finding is only partially resolved: NaN propagation is gone, but the fallback now disables covariate adjustment for an entire baseline whenever any one control is dropped.
  • Severity P1 [Newly identified]: trends_nonparam does not enforce the source-material requirement that the set definition be coarser than group; over-granular inputs fail late through empty control pools instead of a targeted validation error.
  • Severity P1 [Newly identified]: heterogeneity now hard-rejects trends_linear and trends_nonparam, but that extra restriction is not documented in REGISTRY.md or in the public fit() docstring.
  • No separate performance or security concerns stood out in the changed files.

Methodology

  • Severity P1. Impact: _compute_covariate_residualization() aborts DID^X residualization for the whole baseline as soon as any control coefficient is NaN, even though local solve_ols() already preserves the identified coefficients and computes residuals from the reduced design. That means one collinear or duplicated control turns a baseline-specific DID^X path back into a raw, unadjusted path while the public surface is still labeled as DID^X. REGISTRY.md currently says the downstream DID path uses residualized outcomes, so this is an undocumented estimator mismatch. Concrete fix: when some controls are dropped, refit on the finite subset or apply residualization with the finite theta_hat entries instead of continue; add a regression test with duplicated controls where one control remains estimable. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2851, diff_diff/linalg.py#L485, docs/methodology/REGISTRY.md#L612. (cran.r-universe.dev)
  • Severity P1 [Newly identified]. Impact: trends_nonparam validates only time-invariance, but the reference contract also requires the set/interacted-set definition to be coarser than group. With a group-level or otherwise over-granular set ID, the implementation falls through to empty same-set control pools and either excludes switchers or raises a generic horizon-1 “no switching groups” error, rather than a targeted assumption-check failure. Concrete fix: add an explicit pre-fit check that the trends_nonparam partition is coarser than group and add a targeted test using trends_nonparam=group. References: diff_diff/chaisemartin_dhaultfoeuille.py#L1055, diff_diff/chaisemartin_dhaultfoeuille.py#L3387, tests/test_chaisemartin_dhaultfoeuille.py#L2656. (cran.r-universe.dev)
  • Severity P1 [Newly identified]. Impact: the new heterogeneity path rejects trends_linear and trends_nonparam outright, but the reference interface documents predict_het as incompatible with normalized and controls, not with trend-adjusted or state-set specifications. Because REGISTRY.md and the fit() docstring mention only the controls restriction, this is an undocumented methodology/API deviation rather than a documented limitation. Concrete fix: either implement heterogeneity on the trends_linear and trends_nonparam paths, or document those extra restrictions in REGISTRY.md and the public docstring, then align the tests to that documented contract. References: diff_diff/chaisemartin_dhaultfoeuille.py#L2145, docs/methodology/REGISTRY.md#L618, tests/test_chaisemartin_dhaultfoeuille.py#L2784, tests/test_chaisemartin_dhaultfoeuille.py#L2793. (cran.r-universe.dev)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the methodology/contract issues above.

Tech Debt

No matching TODO.md entry mitigates the P1 findings above; the existing dCDH TODO items cover only Phase 1 placebo SE and mixed-direction parity. Reference: TODO.md#L51

Security

No findings.

Documentation/Tests

Path to Approval

  1. In _compute_covariate_residualization(), stop disabling DID^X for an entire baseline when only some controls are dropped; residualize with the retained controls and add a duplicated-control regression test.
  2. Add a targeted trends_nonparam validation that rejects set definitions that are not coarser than group, and cover it with a trends_nonparam=group test.
  3. Either implement heterogeneity for trends_linear and trends_nonparam, or explicitly document those extra restrictions in REGISTRY.md and the fit() docstring and keep the rejection tests as the documented contract.

P1: DID^X rank-deficiency now residualizes with finite subset of
    theta_hat (zeroing NaN coefficients) instead of skipping entirely.

P1: trends_nonparam now rejects set definitions that are not coarser
    than group (singleton sets have no within-set controls).

P1: heterogeneity restrictions on trends_linear and trends_nonparam
    now documented in REGISTRY.md and fit() docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bc1dab76d3175c18152776658a22a427d5b4a98d


Overall Assessment

⛔ Blocker

Executive Summary

  • Prior re-review items are mostly addressed: the DID^X collinearity fallback now keeps identified controls, trends_nonparam=group now errors, and the heterogeneity restrictions are documented in both fit() and the Methodology Registry. References: diff_diff/chaisemartin_dhaultfoeuille.py:2866, diff_diff/chaisemartin_dhaultfoeuille.py:1082, docs/methodology/REGISTRY.md:618
  • P0: trends_nonparam is propagated into the main multi-horizon DID and IF paths, but not into the dynamic placebo point-estimate or placebo IF paths, so placebo diagnostics use the wrong control pool under state-set trends.
  • P1: DID^X first-stage regressions still hard-fail when a baseline-specific control sample has fewer observations than regressors; this new code path does not handle the small-sample edge case gracefully.
  • No separate performance or security concerns stood out in the changed files.

Methodology

  • Severity P0. Impact: under trends_nonparam, the PR correctly restricts same-set controls in _compute_multi_horizon_dids() and _compute_per_group_if_multi_horizon(), but the placebo paths still use unrestricted baseline-matched controls. That means results.placebo_event_study, placebo SEs, and placebo p-values are silently computed for the wrong estimator whenever state-set trends are requested. The CRAN manual defines trends_nonparam as comparing switchers only to not-yet-switched controls with the same baseline treatment and the same set value, and it defines placebos as pre-treatment comparisons between switchers and “their controls,” so the same restriction has to carry into the placebo contract. Concrete fix: thread set_ids through _compute_multi_horizon_placebos() and _compute_per_group_if_placebo_horizon(), apply the same same-set mask used in the positive-horizon DID/IF helpers, and add a regression test where cross-set controls change the placebo estimate and placebo SE under trends_nonparam. References: diff_diff/chaisemartin_dhaultfoeuille.py:1279, diff_diff/chaisemartin_dhaultfoeuille.py:1416, diff_diff/chaisemartin_dhaultfoeuille.py:3406, diff_diff/chaisemartin_dhaultfoeuille.py:3541, diff_diff/chaisemartin_dhaultfoeuille.py:3566, diff_diff/chaisemartin_dhaultfoeuille.py:3661. (search.r-project.org)
  • Severity P1. Impact: _compute_covariate_residualization() still aborts the fit when any baseline-specific DID^X first-stage has fewer observations than regressors. The reference contract says those regressions are estimated separately by baseline treatment and, if one stratum has fewer observations than variables, the command continues running while dropping some controls and warning the user. In this implementation, the design includes an intercept, all controls, and non-reference time FEs, then solve_ols() raises immediately on n < k, so a legitimate small-stratum panel crashes instead of degrading gracefully. Concrete fix: detect n_obs < n_params before calling solve_ols() and either emulate the reference behavior by dropping unidentified control/time-FE columns with a warning, or skip residualization for that baseline with a documented warning; then add a small-sample regression test with multiple controls and too few not-yet-treated control cells. References: diff_diff/chaisemartin_dhaultfoeuille.py:2828, diff_diff/chaisemartin_dhaultfoeuille.py:2843, diff_diff/linalg.py:545. (search.r-project.org)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the propagation issue above.

Tech Debt

No matching TODO.md entry mitigates the P0/P1 issues above; the existing dCDH TODO items only track Phase 1 placebo SE and mixed-direction parity. Reference: TODO.md:51

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: adjusted event-study outputs are still labeled as plain DID_l in summary() and to_dataframe("event_study"), so users cannot tell raw, DID^X, and DID^{fd}` horizons apart from the public surface. Concrete fix: thread feature-specific horizon labels into both renderers. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:717, diff_diff/chaisemartin_dhaultfoeuille_results.py:994
  • Verification note: I did not execute the test suite in this environment; the available Python interpreter is missing numpy.

Path to Approval

  1. Propagate set_ids into _compute_multi_horizon_placebos() and _compute_per_group_if_placebo_horizon(), apply the same same-set control restriction there, and add a trends_nonparam + placebo regression test where cross-set controls would otherwise change the placebo output.
  2. Make _compute_covariate_residualization() handle n_obs < n_params per baseline without aborting the fit; add a test with multiple controls/time FE and too few not-yet-treated control cells.
  3. Rename the public event-study labels in summary() and to_dataframe("event_study") so adjusted horizons are surfaced as adjusted estimands.

P0: trends_nonparam set_ids now threaded into both placebo functions
    (_compute_multi_horizon_placebos and _compute_per_group_if_placebo_horizon)
    so placebo diagnostics use same-set controls when state-set trends active.

P1: DID^X first-stage now guards n_obs < n_params before calling solve_ols.
    Skips residualization for that baseline with warning instead of crashing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 000dc46f63ade3e2cc74e6a5b659504f0ba34ac3


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior P0 on trends_nonparam placebo control pools appears fixed: set_ids now reaches both placebo point-estimate and placebo IF helpers.
  • P1: DID^X still diverges from the source contract when a baseline-specific first stage is empty or underidentified; the code now warns and leaves that baseline unresidualized instead of handling it the way the reference implementation does.
  • P1 [Newly identified]: the new DID^X first-stage regressions ignore collapsed cell weights on disaggregated inputs, so results can silently diverge from the documented/reference estimator when (g,t) cell sizes differ.
  • P3: adjusted event-study outputs are still labeled as plain DID_l in the public renderers.
  • No separate performance or security concerns stood out in the changed files.

Methodology

  • Severity P1. Impact: method affected is DID^X (Web Appendix Section 1.2). In _compute_covariate_residualization(), baseline strata with n_obs == 0 or n_obs < n_params are kept in-sample and left completely unresidualized, so the fitted event-study/overall ATT can silently mix adjusted and raw outcomes for different baseline-treatment levels. The official did_multiplegt_dyn docs say the controls regressions are estimated separately by baseline treatment on not-yet-treated controls, and the reference implementation warns about singular designs by dropping unidentified controls and drops only baseline-treatment levels where residualization actually fails; it does not retain those levels unadjusted. Concrete fix: make failed first stages source-consistent by either estimating with dropped controls/time-FE columns or dropping the failed baseline-treatment level from the estimation sample with an explicit warning and REGISTRY note; add regression tests for both the n_obs == 0 and n_obs < k branches. References: diff_diff/chaisemartin_dhaultfoeuille.py:2793, diff_diff/chaisemartin_dhaultfoeuille.py:2847, REGISTRY.md:612. (search.r-project.org)

  • Severity P1 [Newly identified]. Impact: method affected is again DID^X. The new first-stage regressions are run via unweighted solve_ols() even though the official contract says estimation is weighted by the number of observations (or summed weights) in each group × time cell when the input is more disaggregated than group × time, and the reference residualization step passes weight_XX into the FE regression. On individual-level panels with unequal cell sizes, this will silently move the residualization away from the reference estimator. Concrete fix: pass the aggregated cell weights for the valid first-difference observations into solve_ols(weights=...), and add a regression test with unequal numbers of rows per (g,t) cell. References: diff_diff/chaisemartin_dhaultfoeuille.py:2864, REGISTRY.md:612. (search.r-project.org)

Code Quality

No separate findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No separate findings beyond the public-surface labeling issue below.

Tech Debt

No matching entry in TODO.md:51 mitigates the P1s above; the existing dCDH TODOs only cover Phase 1 per-period placebo SE and mixed-direction parity.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Make DID^X first-stage failure handling source-consistent: do not keep baseline-treatment strata unadjusted when residualization is empty or underidentified; either estimate with dropped controls/time FE or drop those strata from the estimation sample, and add regression tests for n_obs == 0 and n_obs < k.
  2. Pass collapsed cell weights into the DID^X first-stage regressions, and add a regression test on disaggregated input with unequal (g,t) cell sizes.

P1: Failed DID^X first-stage strata (n_obs=0 or n_obs < n_params) now
    have outcomes set to NaN so they're excluded from downstream DID
    computation. Previously left unadjusted, mixing raw + adjusted.

P1: Cell-weight deviation documented in REGISTRY (equal cell weights
    vs R's N_gt observation-count weights - same Phase 1 convention).

P3: Event study rows in to_dataframe("event_study") now labeled as
    DID^X_h, DID^{fd}_h, DID^{X,fd}_h when adjustments active.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8fd2e604a92e930bd5cfed3243a7c844811c6c66


Overall Assessment

Needs changes

Executive Summary

  • The prior trends_nonparam placebo control-pool issue appears resolved: set_ids now reaches both the positive-horizon and placebo point-estimate / IF helpers.
  • The prior DID^X cell-weighting concern is now explicitly documented in docs/methodology/REGISTRY.md, so under the review rules it is informational only, not a blocker.
  • One prior P1 remains unresolved: failed DID^X first-stage strata are set to NaN during residualization, but they are not actually excluded from downstream horizon eligibility. That can poison adjusted event-study and placebo outputs instead of dropping only the failed strata.
  • to_dataframe("event_study") now labels adjusted horizons correctly, but summary() still prints plain DID_l, and the combined controls + trends_linear overall label still drops the X.
  • No separate performance or security problems stood out in the changed files.

Methodology

Code Quality

  • No separate findings. The changed inference paths continue to use safe_inference(), and I did not see a new inline t-stat / partial-NaN anti-pattern in the modified files.

Performance

  • No findings.

Maintainability

  • No separate findings beyond the public-surface labeling issue below.

Tech Debt

  • No separate findings. TODO.md does not mitigate the P1 above; the existing dCDH review-debt entries cover different items. References: TODO.md:L51-L60.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Thread a failed-stratum exclusion mask out of _compute_covariate_residualization() and apply it consistently to adjusted positive-horizon and placebo point-estimate / IF helpers before eligible, N_l, and N_pl_l are finalized.
  2. Add regression tests for both DID^X failure branches (n_obs == 0 and n_obs < n_params) where another baseline remains estimable, and assert the valid adjusted event-study / placebo outputs stay finite while the failed stratum is excluded.
  3. Fix the remaining Phase 3 results labels in summary() and _estimand_label() so DID^X, DID^{fd}, and DID^{X,fd} render consistently.

…bels

P1: Failed DID^X first-stage strata now have N_mat zeroed out (not just
    Y_mat NaN'd). The downstream eligibility checks (N_mat[g,idx] > 0)
    in all DID/IF/placebo functions naturally exclude these groups from
    N_l/N_pl_l, preventing NaN poisoning of otherwise estimable horizons.

P3: _estimand_label() now returns DID^{X,fd}_l (not DID^{fd}_l) when
    both controls and trends_linear active with L_max>=2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d087f2152706b0d2e14718b5ccaab50d21640f31


Overall Assessment

⛔ Blocker

Executive Summary

  • The prior DID^X failed-first-stage exclusion issue looks materially improved: the explicit n_obs==0 / n_obs<n_params branches now feed failed_baselines and zero N_mat before the multi-horizon and placebo paths.
  • One blocker remains in the linear-trends extension: trends_linear=True is not propagated correctly through the existing normalization / aggregation surfaces.
  • The new DID^X validation still accepts non-finite controls; an Inf control can collapse the first stage after finite-row filtering and silently leave a baseline unadjusted.
  • The registry now documents the intended R deviations (cell-weight DID^X first stage, conservative cumulated DID^{fd} SE, partial heterogeneity, descriptive Design-2), so those items are informational only.
  • Public reporting is still only partially fixed: summary() keeps plain DID_l labels and a Cost-Benefit Delta header on adjusted trend summaries.

Methodology

Code Quality

  • Severity P1. Impact: the DID^X input guard rejects NaN controls but not ±Inf. After pd.to_numeric, Inf survives validation, _compute_covariate_residualization() drops non-finite first-stage rows via finite_mask, and the n_obs==0-after-filter branch just continues without warning or failed_baselines bookkeeping. Cells with non-finite covariates are also left raw in the level residualization step. That means a non-finite control can silently mix adjusted and unadjusted cells or fall back to raw DID for an affected baseline, contrary to the registry note that failed first stages are excluded from estimation. Concrete fix: require np.isfinite on control columns up front, and route the n_obs==0-after-finite_mask branch through the same warning / failed_baselines exclusion path used for the explicit failure branches. References: diff_diff/chaisemartin_dhaultfoeuille.py:L648, diff_diff/chaisemartin_dhaultfoeuille.py:L2828, diff_diff/chaisemartin_dhaultfoeuille.py:L2929, docs/methodology/REGISTRY.md:L612.

Performance
No findings.

Maintainability

Tech Debt
No findings. The current TODO.md entries do not track the P0/P1 above, so they remain unmitigated.

Security
No findings.

Documentation/Tests

  • Severity P3. Impact: the new tests cover parity for point effects under controls / trends, but they do not exercise the normalized_effects / cost_benefit_delta interaction under trends_linear or the non-finite-control DID^X failure path. The trend tests currently lock in cost_benefit_delta is None instead. Concrete fix: add regression tests for trends_linear normalized / delta outputs and for controls with np.inf (or another finite-filter-to-zero first-stage case). References: tests/test_chaisemartin_dhaultfoeuille.py:L2554, tests/test_chaisemartin_dhaultfoeuille_parity.py:L385.
  • I did not rerun the suite in this environment; python3 here does not have the project dependencies installed (numpy is missing).

Path to Approval

  1. For trends_linear=True, stop computing normalized_effects from the raw DID^{fd}_l second differences. Either implement the generalized normalized / delta aggregations from the cumulated trend-adjusted path, or gate both surfaces off and document that limitation in REGISTRY.md.
  2. Reject non-finite control values during DID^X validation, and make the finite_mask -> n_obs==0 branch warn and exclude the failed baseline exactly like the explicit n_obs==0 / n_obs<n_params branches.
  3. Add regression coverage for both issues: one trends_linear case that checks normalized / delta behavior at l>=2, and one DID^X case with non-finite controls that must fail loudly or exclude the baseline.
  4. Finish the remaining adjusted-label fixes in summary() so the printed surface matches the already-fixed to_dataframe() labels.

…guard

P0: normalized_effects suppressed under trends_linear (was normalizing
    second-differences DID^{fd}_l instead of level effects). REGISTRY
    documents that normalized_effects and cost_benefit_delta are both
    unavailable under trends_linear.

P1: Non-finite (Inf) control values now rejected with ValueError during
    DID^X validation (was silently collapsing first-stage OLS).

P3: summary() event study header and row labels now use _horizon_label()
    (DID^X_l, DID^{fd}_l, DID^{X,fd}_l matching to_dataframe).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e3d51db3b187db18be0176a69ce3f2180df78942


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior blocker on trends_linear normalization / cost-benefit output is resolved: those surfaces are now suppressed and documented in the registry.
  • The prior DID^X non-finite-control issue is also resolved: Inf controls are now rejected before residualization.
  • Severity P1: trends_nonparam does not reject missing set assignments; all-NA groups can pass validation and then get silently treated as having no same-set controls.
  • Severity P1: design2_effects does not propagate active adjustments consistently; controls are respected in the plain Design-2 path but silently dropped once trends_linear=True.
  • Severity P2: controls=[] falls through to an internal np.stack([]) failure instead of a clear API-level validation error.
  • Severity P3: trend-adjusted public reporting is still partially inconsistent; summary() still presents a Cost-Benefit Delta block even when that surface is intentionally suppressed.

Methodology

  • Severity P1. Impact: the new state-set trend path does not validate missing set membership. trends_nonparam only checks column existence, time-invariance, and that the partition is coarser than group; it never rejects NA-only groups. Those groups then hit the same-set filter with NaN == NaN -> False, so they are silently treated as having no within-set controls and get excluded under a generic “no baseline-matched controls” warning. That is a missing assumption/input check on the Section 1.4 control-pool definition, and it changes control-group composition without a targeted error. Concrete fix: explicitly reject missing set assignments before building set_ids_arr, with counts/examples, and add a regression test for all-NA and mixed-NA set membership. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1071-L1111, diff_diff/chaisemartin_dhaultfoeuille.py:L3454-L3456, diff_diff/chaisemartin_dhaultfoeuille.py:L3589-L3591, docs/methodology/REGISTRY.md:L616-L616

  • Severity P1. Impact: the new Design-2 surface has inconsistent parameter propagation. In the results builder, design2_effects uses Y_mat when trends_linear=False, so controls flow through as residualized levels; but the same code path switches to raw y_pivot whenever trends_linear=True, which silently drops the requested covariate adjustment in the combined design2 + controls + trends_linear case. The registry note only documents that Design-2 is a descriptive convenience wrapper, not that active adjustments are conditionally ignored. This is an incomplete parameter-propagation / semantic-contract bug on a new public surface. Concrete fix: either carry the adjusted level-outcome surface through to Design-2 even when trends are enabled, or explicitly reject unsupported design2 + adjustment combinations with a clear error and document that contract. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2354-L2372, docs/methodology/REGISTRY.md:L620-L620

  • Severity P3. Impact: the previous trends_linear blocker is now mitigated correctly. normalized_effects and cost_benefit_delta are no longer exposed under trends_linear, and the registry documents that suppression as an intentional deviation. Concrete fix: none. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1523-L1537, diff_diff/chaisemartin_dhaultfoeuille.py:L2163-L2174, docs/methodology/REGISTRY.md:L614-L614

Code Quality

  • Severity P2. Impact: controls=[] is not handled explicitly. The empty list passes the early controls is not None gate, then the fit path reaches np.stack(X_pivots, axis=2) with no arrays and raises an opaque internal error instead of a user-facing validation message. This is a new edge-case failure path on the public API. Concrete fix: treat an empty list as None or raise an early ValueError("controls must contain at least one column name"), and add a regression test. References: diff_diff/chaisemartin_dhaultfoeuille.py:L633-L673, diff_diff/chaisemartin_dhaultfoeuille.py:L972-L980

Performance

No findings.

Maintainability

  • Severity P3. Impact: the reporting surface is still partially inconsistent for trends_linear with L_max >= 2. summary() still centers an overall Cost-Benefit Delta block even though delta is intentionally suppressed in that mode, and it never prints the actual cumulated level effects from linear_trends_effects. The underlying estimator contract is now safe, but the summary surface still nudges users toward the wrong top-level object. Concrete fix: make the overall block explicitly say that no aggregate delta is reported under trends_linear, and add a separate linear_trends_effects table to summary(). References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L582-L590, diff_diff/chaisemartin_dhaultfoeuille_results.py:L726-L748

Tech Debt

No findings. TODO.md does not already track the P1 items above, so they remain unmitigated. Reference: TODO.md:L51-L79

Security

No findings.

Documentation/Tests

I did not rerun the test suite in this environment.

Path to Approval

  1. Reject missing / NA set assignments in trends_nonparam before constructing set_ids_arr, and add a regression test that proves the path now fails loudly instead of silently excluding those groups.
  2. Define a single contract for design2 under active adjustments: either compute it from the adjusted level-outcome surface when controls / trends_linear are active, or reject unsupported combinations with a clear ValueError. Add regression coverage for design2=True with controls, with trends_linear, and with both together.

P1: trends_nonparam now rejects NaN/missing set assignments with ValueError.
P1: design2_effects always uses raw level outcomes from y_pivot (not
    residualized or first-differenced Y_mat).
P2: controls=[] now raises ValueError instead of crashing on np.stack([]).
P3: summary() overall block labeled "N/A under trends_linear" when
    trends + L_max>=2 (was "Cost-Benefit Delta" with NaN value).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8a57c5df34a85cce09c71863187801dac8c6a028


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Static review indicates the prior trends_nonparam missing-set blocker is addressed: missing set IDs are now rejected before set_ids_arr is built.
  • The prior Design-2 propagation blocker remains unresolved, and it is broader now: design2_effects still ignores controls, trends_linear, and trends_nonparam.
  • The DID^X cell-weighted first stage, the DID^{fd} conservative cumulated SE, and the general “Design-2 is a convenience wrapper” deviations are documented in REGISTRY.md, so I did not count those documented deviations as defects.
  • I did not find new P0 NaN/inference inconsistencies in the added DID^X / DID^{fd} / state-set / heterogeneity code paths.
  • This was a static review only; I could not run the test suite in this shell because pytest/numpy are unavailable.

Methodology

  • Severity P1. Impact: design2_effects is still computed from raw level outcomes and does not propagate active adjustment options or same-set restrictions. fit() always passes raw y_pivot.to_numpy() into _compute_design2_effects(), and the helper itself has no way to consume residualized outcomes, first-differenced outcomes, or set IDs. That means a fit with design2=True plus controls, trends_linear, or trends_nonparam produces a mixed results object: the main estimator is adjusted, but the Design-2 block is not. The current registry note documents Design-2 as a descriptive convenience wrapper, but it does not explicitly document that these active options are ignored. This is an incomplete parameter-propagation / semantic-contract defect on a new public surface. References: diff_diff/chaisemartin_dhaultfoeuille.py:L2375-L2387, diff_diff/chaisemartin_dhaultfoeuille.py:L3147-L3222, docs/methodology/REGISTRY.md:L620-L620. Concrete fix: either reject design2=True whenever any of controls, trends_linear, or trends_nonparam is active, with an explicit ValueError plus a registry note, or recompute the Design-2 block from the same adjusted outcome surface and thread the same-set restriction through the helper.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: summary() now suppresses the misleading aggregate under trends_linear, but it still does not surface the new Phase 3 outputs (linear_trends_effects, heterogeneity_effects, design2_effects). Those are only discoverable by inspecting attributes or calling to_dataframe(), which leaves the primary reporting path incomplete for a large feature addition. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L582-L790, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1054-L1074. Concrete fix: add dedicated summary blocks, or at minimum an explicit summary note pointing users to to_dataframe(level="linear_trends"), to_dataframe(level="heterogeneity"), and results.design2_effects.

Tech Debt

  • No findings. TODO.md does not already track the P1 above, so it remains unmitigated. Reference: TODO.md:L51-L79.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Define a single supported contract for design2 under active adjustments. Either make design2 consume the adjusted estimator surface and same-set restriction, or reject design2=True when controls, trends_linear, or trends_nonparam is active.
  2. Add regression coverage for design2=True with controls, trends_linear, and trends_nonparam, plus one missing-set-membership test for trends_nonparam.

P1: REGISTRY now explicitly documents that design2_effects always uses
    raw (unadjusted) outcomes regardless of active controls/trends/nonparam.
    For full adjusted Design-2 estimation, the paper recommends running
    the standard estimator on a restricted subsample with trends_nonparam.

P3: Added test_nan_set_membership_rejected for trends_nonparam NaN guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4da0841a6f00352035465b657be78784a2cbbf2d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous Design-2 propagation blocker is resolved: this PR now explicitly documents design2 as a raw descriptive convenience wrapper, so the remaining mismatch is no longer undocumented.
  • The previous trends_nonparam missing-set-membership blocker is also resolved: NaN set assignments are now rejected and covered by a regression test.
  • New blocker: trends_nonparam still lacks the paper’s set-support check, so it can silently change the estimand by dropping unsupported switcher/horizon pairs instead of rejecting or documenting the unsupported design.
  • I did not find new P0 NaN/inference inconsistencies in the added DID^X, DID^{fd}, heterogeneity, or Design-2 code paths.
  • Static review only: I could not execute the test suite in this shell because importing the package fails without numpy.

Methodology

  • Severity P1. Impact: the state-set-trends implementation (trends_nonparam) only validates that the set column exists, is time-invariant, and is coarser than group, then relies on the same-set control restriction to silently drop switcher/horizon pairs when no within-set controls remain. See diff_diff/chaisemartin_dhaultfoeuille.py:L1077 and diff_diff/chaisemartin_dhaultfoeuille.py:L3475. The paper’s state-set extension requires Assumption 14, including a common last untreated period across sets; otherwise the target parameters must be redefined rather than estimated as the paper’s (δ_{+,ℓ}). REGISTRY.md documents the same-set restriction but not this support-condition omission. See docs/methodology/REGISTRY.md:L616. Concrete fix: compute set-specific support (T_u^s) up front and either reject panels that violate Assumption 14, or explicitly document and warn that the library trims to horizon-specific within-set support and therefore estimates a different parameter. (nber.org)

  • Severity P3. Impact: the prior Design-2 blocker is now mitigated. The code feeds raw levels into the Design-2 block at diff_diff/chaisemartin_dhaultfoeuille.py:L2375, and the deviation is explicitly documented at docs/methodology/REGISTRY.md:L620. Concrete fix: none required.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: summary() still does not surface linear_trends_effects, heterogeneity_effects, or design2_effects, so large parts of the new Phase 3 surface remain discoverable only through attribute inspection or to_dataframe() for two of the three outputs. See diff_diff/chaisemartin_dhaultfoeuille_results.py:L506 and diff_diff/chaisemartin_dhaultfoeuille_results.py:L1054. Concrete fix: add a short summary footer pointing users to to_dataframe("linear_trends"), to_dataframe("heterogeneity"), and design2_effects, and consider a to_dataframe("design2") level.

Tech Debt

  • No findings. I did not see this methodology issue tracked in TODO.md:L51, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the added tests cover basic trends_nonparam validation, but there is still no regression test for the paper-critical case where untreated support ends at different horizons across sets. Existing coverage starts at tests/test_chaisemartin_dhaultfoeuille.py:L2613 and does not pin the unequal-T_u^s branch. Concrete fix: add a two-set panel where one set loses untreated controls earlier than the other and assert the chosen reject-or-document behavior.

Path to Approval

  1. Add a fit-time trends_nonparam support check for Assumption 14, especially equality of set-specific last-untreated horizons (T_u^s) across sets; reject unsupported panels or explicitly redefine/document the estimand.
  2. Add a regression test covering unequal within-set support horizons, so the chosen behavior is pinned and future changes cannot silently reintroduce the mismatch.

P1: REGISTRY now documents that trends_nonparam does not enforce
    Assumption 14 (common last-untreated period across sets) up front.
    When within-set controls are exhausted at a given horizon, affected
    switcher/horizon pairs are excluded via the existing empty-control-pool
    mechanism. The effective estimand is trimmed to within-set support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d20efde1c20dab8f41dc23dc490321aeb9b7afe1


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior trends_nonparam common-support blocker is now mitigated: REGISTRY.md explicitly documents the Assumption 14 omission as a trimmed-support deviation, so it is no longer a defect.
  • The prior Design-2 documentation blocker is also mitigated: REGISTRY.md now labels design2 as a descriptive raw-outcome convenience wrapper.
  • New blocker: trends_linear=True is applied before the legacy per_period_effects path, so results.per_period_effects / to_dataframe("per_period") no longer represent the Phase 1 per-period DID decomposition. With first-differenced outcomes, later periods become second-difference objects, and a first-post-baseline switch can yield NaN because Z[:,0] is explicitly NaN.
  • I did not find new P0 NaN/inference anti-patterns in the added SE/inference code; the new branches consistently use safe_inference().
  • Static review only: I could not execute the test suite in this shell because python fails to import numpy.

Methodology

Affected methods in this PR are the paper’s covariate-adjusted DID (DID^X), group-specific linear-trends DID (DID^{fd} / Lemma 6), state-set trends, and heterogeneity testing extensions from the Web Appendix. (rdocumentation.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new trends_nonparam tests start at tests/test_chaisemartin_dhaultfoeuille.py:2613, but they still do not pin the newly documented unequal-support trimming branch described at docs/methodology/REGISTRY.md:616. That leaves the trimmed-support behavior unguarded against regression. Concrete fix: add a two-set panel where one set exhausts untreated controls earlier than the other, then assert the warning and the horizon-specific exclusion behavior.

Static review only: I could not run the suite because python in this environment raises ModuleNotFoundError: numpy.

Path to Approval

  1. Make results.per_period_effects / to_dataframe("per_period") consistent under trends_linear=True: either keep them on raw outcomes or suppress/rename them so they no longer advertise Phase 1 did_plus_t / did_minus_t quantities while using first-differenced outcomes.
  2. Add regression tests covering trends_linear=True for a panel with a first-post-baseline switch, plus the controls + trends_linear branch, and assert the intended per-period contract so the surface cannot silently flip between raw and second-difference behavior.

P1: When trends_linear=True without controls, _compute_per_period_dids
    and _compute_cohort_recentered_inputs now use raw y_pivot outcomes
    (not the first-differenced Z_mat). Previously the per-period path
    would double-difference the already-differenced outcomes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cd73917dea80c62f7f90d84697ce4c7e0cd25213


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P1 on trends_linear=True corrupting the public per-period surface is resolved: _compute_per_period_dids() now receives raw outcomes/counts instead of the first-differenced matrix, so per_period_effects no longer silently become second-difference objects. See diff_diff/chaisemartin_dhaultfoeuille.py:1143 and diff_diff/chaisemartin_dhaultfoeuille.py:1152.
  • I did not find any new unmitigated P0/P1 issues in the DID^X, trends_lin, trends_nonparam, heterogeneity, or Design-2 branches.
  • P3: design2_effects is still only exposed as a raw attribute; there is no to_dataframe("design2") path or summary surfacing.
  • P3: the public results docs are not fully updated for the new Phase 3 surfaces, and _compute_covariate_residualization() still has a stale return annotation.
  • Static review only: I could not execute tests here because this shell lacks numpy and pytest.

Methodology

No unmitigated findings. The affected methods are DID^X covariate adjustment, trends_lin, trends_nonparam, and predict_het/heterogeneity. The reference package manual describes controls as first-difference residualization with time fixed effects estimated on not-yet-switched cells and separately by baseline treatment, trends_lin as computing effects on first differences and summing them back to levels, trends_nonparam as restricting controls to the same time-invariant set, and predict_het as including placebo regressions and a joint test. Against that source material and the registry, this PR’s remaining departures read as documented limitations, not blockers. (cran.r-universe.dev)

The paper/manual also support the PR’s treatment of trend-adjusted effects as first-difference estimands cumulated back to levels and note the equal-support requirement for set-specific trends. The registry now documents the finite-sample SE normalization gap, the trimmed-support behavior under trends_nonparam, the partial heterogeneity implementation, and the Design-2 convenience-wrapper scope, so those are mitigated P3 informational items rather than defects. See docs/methodology/REGISTRY.md:612, docs/methodology/REGISTRY.md:614, docs/methodology/REGISTRY.md:616, docs/methodology/REGISTRY.md:618, and docs/methodology/REGISTRY.md:620. (cran.r-universe.dev)

Code Quality

Performance

No findings.

Maintainability

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Static review only: python here raises ModuleNotFoundError: numpy, and pytest is unavailable, so I could not run the added test cases.

- Fix _compute_covariate_residualization return type annotation to
  include failed_baselines set (was 2-tuple, now 3-tuple)
- Add to_dataframe("design2") level for Design-2 results
- Update to_dataframe() docstring with all new levels
- Update results dataclass docstring: replace Phase 3 placeholder
  text with actual field descriptions for covariate_residuals,
  linear_trends_effects, heterogeneity_effects, design2_effects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 13, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bfed3e649218f13263d56cdb3471451962e19ce7


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 13, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit d801cc1 into main Apr 14, 2026
19 checks passed
@igerber igerber deleted the dcdh-phase3b-covariates-trends branch April 14, 2026 00:39
@igerber igerber mentioned this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant