diff --git a/CHANGELOG.md b/CHANGELOG.md index 49455a6e..f73d63bf 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,8 +10,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **`generate_synthetic_control_data()` data generator + a capstone `SyntheticControl` tutorial.** New public generator (`diff_diff/prep_dgp.py`, exported from `diff_diff`) builds a **single-treated-unit** factor-model panel for synthetic-control demos and tests: one treated unit whose latent factor loadings and baseline are an exact convex combination of a few donors (so the noiseless trajectory lies in the donor convex hull and a synthetic control reproduces it closely — the observed fit is approximate under added noise), persistent AR(1) factors, predictor covariates that each proxy a distinct factor, a common calendar time effect, and a known `"ramp"` or `"constant"` treatment effect emitted as `true_effect`. Tutorial **`docs/tutorials/25_synthetic_control_policy.ipynb`** walks the whole `SyntheticControl` surface end-to-end on a policy-evaluation story (one state adopts a clean-energy standard), structured around **two inference philosophies**: cross-unit permutation (`in_space_placebo` + Firpo–Possebom `confidence_set`, with `leave_one_out` / `in_time_placebo` robustness) versus over-time conformal (CWZ `conformal_test` / `conformal_confidence_intervals` / `conformal_average_effect`), with the per-period conformal band as the climax. A `tests/test_t25_synthetic_control_policy_drift.py` drift guard re-derives every quoted number from the generator. - **`TwoStageDiD` methodology validation (Gardner 2022 / `did2s`).** New `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components (§3 two-stage procedure / eqs. 4 & 6, §3.3 GMM variance, fn. 19 always-treated exclusion, Proposition 5, covariate path, `balance_e`, `vcov_type` narrowing) plus a `did2s::did2s()` cross-language parity fixture (`benchmarks/R/generate_did2s_golden.R` → `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`), pinning overall + event-study ATT (abs 1e-6) and SE (abs 1e-7). `METHODOLOGY_REVIEW.md` `TwoStageDiD` row flipped to **Complete**. +- **`p_val_type` parameter on `DifferenceInDifferences` (and inherited `TwoWayFixedEffects`)** for the wild cluster bootstrap, mirroring `fwildclusterboot::boottest`: `"two-tailed"` (default — test on `|t|`, two-tailed inverted CI, which may be asymmetric) or `"equal-tailed"` (each tail at `alpha/2`, equal-tailed CI). Only used when `inference="wild_bootstrap"`; inert on `MultiPeriodDiD` (which falls back to analytical inference). ### Fixed +- **Wild cluster bootstrap (`inference="wild_bootstrap"`) now imposes the null — fixes an invalid p-value (issue #543).** `DifferenceInDifferences`/`TwoWayFixedEffects` with `inference="wild_bootstrap"` previously produced a p-value that contradicted its own confidence interval (e.g. CI `[2.30, 2.64]` excluding 0, yet `p = 0.86`). `diff_diff.utils.wild_bootstrap_se` *claimed* to run the Wild Cluster **Restricted** bootstrap but never actually imposed the null — it re-fit the full design (keeping the treatment column) to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap coefficient distribution centered on the estimate instead of 0. The p-value `mean(|t*| ≥ |t₀|)` then measured noise around the estimate (≈0.5–0.86 regardless of significance) while the percentile-of-coefficients CI happened to look fine — an internal contradiction. The bootstrap now genuinely imposes H₀ (drops the coefficient's column for the restricted fit), studentizes with the analytical CR1 SE, and derives the CI by **test inversion** so the p-value and CI are exactly consistent (`0 ∈ CI ⟺ p ≥ alpha`). For Rademacher weights with few clusters the full `2**n_clusters` sign-vector set is enumerated (deterministic), matching R's `fwildclusterboot::boottest`. **Results change** for any prior `wild_bootstrap` use: the headline `p_value`/`conf_int` are corrected (a true effect is now correctly significant), and the reported `se` is now the analytical cluster-robust (CR1) SE (numerically ~unchanged in well-behaved cases). Validated against `fwildclusterboot::boottest()` (`benchmarks/R/generate_wild_cluster_boot_golden.R`; bootstrap t-distribution to ~6e-14, `se`/`t`/interior-`p` exact, CI to ~1e-4) and an independent full-refit enumeration. See `docs/methodology/REGISTRY.md` §"Wild cluster bootstrap (WCR)". +- **Cluster-robust / HC1 standard errors no longer raise `ZeroDivisionError` on a saturated design.** `linalg.compute_robust_vcov` (NumPy path) divided by `(n_eff - k)` in the HC1/CR1 small-sample adjustment without guarding a design with no residual degrees of freedom (`n_eff == k`, e.g. a 2×2 DiD with one observation per cluster-period); it now returns a NaN vcov so inference is degenerate (NaN), consistent with the all-or-nothing NaN convention, rather than crashing. Surfaced while hardening the wild cluster bootstrap (`wild_bootstrap_se` independently routes saturated / weak-identification designs to NaN, and represents a genuinely unbounded inverted CI with `±inf` instead of mixing finite point estimates with NaN endpoints). +- **Wild cluster bootstrap on a rank-deficient full-dummy design no longer crashes when storing the vcov.** `_run_wild_bootstrap_inference` computed the stored cluster-robust vcov via `compute_robust_vcov(X, ...)` on the full design, which inverts `X'X` directly and raises (or returns garbage) when a nuisance column is collinear (e.g. a fixed-effect dummy collinear with treatment) — even though the ATT is identified and the bootstrap itself drops such columns. It now computes the stored vcov through the rank-aware `solve_ols(..., rank_deficient_action="silent")` path, NaN-expanding the dropped column (bit-identical to the prior result on full-rank designs). - **`TwoStageDiD` analytical GMM standard errors are now exact (match R `did2s` to ~1e-7).** The Gardner two-stage GMM sandwich `_compute_gmm_variance` derived its residuals from the *iterative* alternating-projection first-stage fixed effects (`_iterative_fe`, which converge only to ~1e-7 on unbalanced untreated panels) while computing `gamma_hat` exactly — leaving the variance ~1% off the analytical sandwich. The variance now re-solves the Stage-1 FE **exactly** (sparse OLS, reusing the `gamma_hat` factorization), and `_build_fe_design` gained an intercept column so its column space spans the grand mean (the prior intercept-free design omitted it, and the exact residual is first-order sensitive to it). Unidentified-FE obs (rank-deficient / Proposition-5) fall back to the iterative residual, so those edge cases are unchanged; the reported `overall_att` still uses the iterative FE (point-estimate equivalence with `ImputationDiD` preserved). Mirrors the same-class fix already applied to `ImputationDiD`'s exact-sparse variance. - **`LinearRegression.get_se()` / `get_inference()` no longer return a `NaN` standard error from a tiny-negative variance artifact.** A high-leverage / degenerate coefficient (e.g. an absorbed-FE dummy near-collinear with the treatment, whose Bell-McCaffrey Satterthwaite DOF already hits the noise-floor guard) can have a CR2/HC variance of ~0 (≈1e-32) whose vcov diagonal lands just-below-zero under BLAS-dependent float rounding; `np.sqrt` of the negative then produced a `NaN` SE **nondeterministically** — passing single-threaded but failing under the parallel pure-Python full-suite run (`tests/test_methodology_wls_cr2.py::TestLinearRegressionFENanGuardEndToEnd::test_did_absorbed_fe_lr_inference_nan_for_guarded_coefs`). Both SE sites now clamp the vcov diagonal at 0, so the SE is finite (0 for a genuinely-zero variance), deterministic, and BLAS-independent. **No change for any positive variance** (the clamp is a no-op there); only the previously-`NaN` degenerate case is affected. - **`TripleDifference` power analysis now honors `n_periods > 2`.** `simulate_power`, diff --git a/TODO.md b/TODO.md index 6b0e2968..52f52a63 100644 --- a/TODO.md +++ b/TODO.md @@ -166,6 +166,7 @@ Deferred items from PR reviews that were not addressed before merge. | CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design. | `linalg.py::_compute_cr2_bm` | Phase 1a | Low | | `SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted outcome/predictor panels) on EVERY fit to support the opt-in `in_space_placebo()`, so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as `SyntheticDiD`'s always-on snapshot for `in_time_placebo`). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost). | `synthetic_control.py` snapshot build, `synthetic_control_results.py::_SyntheticControlFitSnapshot` | follow-up | Low | | EfficientDiD DR (covariate) path rebuilds the full polynomial sieve basis `_polynomial_sieve_basis(X, K)` for every candidate `K` inside each of the three nuisance fits (outcome regression, propensity ratio, inverse propensity), per `fit()`. After the growing-sieve cap removal (PR-B), large covariate-adjusted fits at large `n` pay more avoidable basis-construction cost. Cache the basis per `(X, K)` within a `fit()` and share it across the nuisance helpers. | `diff_diff/efficient_did_covariates.py` (the three sieve helpers) | PR-B follow-up | Low | +| Wild cluster bootstrap CI inversion calls `_t_star(r)` ~O(100) times (outward bracketing + bisection per endpoint), and each call materializes a fresh `(B × n)` `y_star` matrix plus the `(k × B)` refit and `(n × B)` residual arrays. For large panels or large `n_bootstrap` this allocation churn is noticeable. The bootstrap is for the few-cluster regime (small `B` when enumerated; `n` typically modest), so it is acceptable today; if a large-`n`/large-`B` user hits it, chunk `_t_star` over bootstrap draws or precompute the `r`-independent cluster-level pieces (the restricted residuals are linear in `r`) so each inversion evaluation avoids rebuilding the full `B × n` matrix. | `diff_diff/utils.py::wild_bootstrap_se._t_star` | #543 | Low | #### Testing/Docs diff --git a/benchmarks/R/generate_wild_cluster_boot_golden.R b/benchmarks/R/generate_wild_cluster_boot_golden.R new file mode 100644 index 00000000..a9cd504e --- /dev/null +++ b/benchmarks/R/generate_wild_cluster_boot_golden.R @@ -0,0 +1,92 @@ +#!/usr/bin/env Rscript +# Golden generator: wild_bootstrap_se vs R `fwildclusterboot::boottest` +# (Roodman, MacKinnon, Nielsen & Webb 2019; Cameron-Gelbach-Miller 2008). +# +# Writes a fixed few-cluster 2x2 DiD design and the R reference so that +# tests/test_wild_bootstrap.py::TestWildBootstrapParityR can pin Python output +# against R without requiring R at test time. +# +# Outputs (checked into the repo): +# benchmarks/data/wild_cluster_boot_test_data.csv (cluster, treated, post, y) +# benchmarks/data/wild_cluster_boot_golden.json +# +# Usage: +# Rscript benchmarks/R/generate_wild_cluster_boot_golden.R +# +# Notes: +# - boottest defaults are the WCR (restricted) bootstrap: impose_null=TRUE, +# type="rademacher". boottest switches to full enumeration once B reaches the +# number of possible draws (2^G); with G=6 and B=99999 it enumerates all +# 2^6=64 raw sign-vectors (deterministic, no RNG), of which 2^(G-1)=32 have +# distinct |t|. The library's wild_bootstrap_se uses the same trigger +# (2^n_clusters <= n_bootstrap) and also materializes all 2^G raw vectors, +# reporting n_bootstrap = 2^G; both yield the same p-value and CI. +# - The DGP uses a deliberately weak effect + heavy noise so the bootstrap +# p-value is INTERIOR (not 0/1), letting the test pin the exact p-value. +# boottest counts strict exceedances |t*| > |t0|; the library matches this +# (it floors the reported p at 1/(B+1), inactive for an interior p). +# - se is the analytical CR1 cluster-robust SE = (G/(G-1))((N-1)/(N-k)) sandwich, +# which the library reports as `se` and uses to studentize the test. boottest +# studentizes with the same CR1 SE, so teststat == coef/se. +# - The CI is obtained by inverting the bootstrap test (boottest's grid search +# vs the library's bisection); they agree to ~1e-4 on this design. + +suppressMessages({ + library(fwildclusterboot) + library(fixest) + library(jsonlite) +}) + +set.seed(20240624) +G <- 6 +obs_per_cluster <- 8 +rows <- list() +i <- 1 +for (c in 0:(G - 1)) { + is_treated <- as.integer(c < G / 2) + cluster_effect <- rnorm(1, 0, 1.5) + for (o in 1:obs_per_cluster) { + for (period in c(0, 1)) { + y <- 4.0 + cluster_effect + 1.0 * period + if (is_treated == 1 && period == 1) y <- y + 0.3 # weak effect + y <- y + rnorm(1, 0, 4.0) # heavy noise -> interior p + rows[[i]] <- data.frame(cluster = c, treated = is_treated, post = period, y = y) + i <- i + 1 + } + } +} +df <- do.call(rbind, rows) +df$inter <- df$treated * df$post + +data_path <- "benchmarks/data/wild_cluster_boot_test_data.csv" +write.csv(df[, c("cluster", "treated", "post", "y")], data_path, row.names = FALSE) + +m <- feols(y ~ treated + post + inter, data = df, cluster = ~cluster) +coef_est <- as.numeric(coef(m)["inter"]) +se_cr1 <- as.numeric(se(m)["inter"]) # CR1 clustered SE + +run_bt <- function(ptype) { + set.seed(1) + bt <- boottest(m, param = "inter", clustid = ~cluster, B = 99999, + type = "rademacher", impose_null = TRUE, + p_val_type = ptype, conf_int = TRUE, sign_level = 0.05) + list(p_val = as.numeric(bt$p_val), + t_stat = as.numeric(bt$t_stat), + conf_int = as.numeric(bt$conf_int), + n_clusters = as.integer(bt$N_G)) +} + +golden <- list( + n_clusters = G, + coef = coef_est, + se_cr1 = se_cr1, + two_tailed = run_bt("two-tailed"), + equal_tailed = run_bt("equal-tailed") +) + +json_path <- "benchmarks/data/wild_cluster_boot_golden.json" +write(toJSON(golden, auto_unbox = TRUE, digits = 12, pretty = TRUE), json_path) +cat("Wrote", data_path, "and", json_path, "\n") +cat(sprintf("coef=%.6f se=%.6f two-tailed p=%.6f CI=[%.6f, %.6f]\n", + coef_est, se_cr1, golden$two_tailed$p_val, + golden$two_tailed$conf_int[1], golden$two_tailed$conf_int[2])) diff --git a/benchmarks/data/wild_cluster_boot_golden.json b/benchmarks/data/wild_cluster_boot_golden.json new file mode 100644 index 00000000..13978a6a --- /dev/null +++ b/benchmarks/data/wild_cluster_boot_golden.json @@ -0,0 +1,17 @@ +{ + "n_clusters": 6, + "coef": 2.23239205609, + "se_cr1": 0.8998239199017, + "two_tailed": { + "p_val": 0.03125, + "t_stat": 2.480920996559, + "conf_int": [0.226920726858, 4.86871767596], + "n_clusters": 6 + }, + "equal_tailed": { + "p_val": 0.03125, + "t_stat": 2.480920996559, + "conf_int": [0.226920726858, 4.86871767596], + "n_clusters": 6 + } +} diff --git a/benchmarks/data/wild_cluster_boot_test_data.csv b/benchmarks/data/wild_cluster_boot_test_data.csv new file mode 100644 index 00000000..dca36153 --- /dev/null +++ b/benchmarks/data/wild_cluster_boot_test_data.csv @@ -0,0 +1,97 @@ +"cluster","treated","post","y" +0,1,0,0.467900024557135 +0,1,1,5.28330432772302 +0,1,0,6.21815961331612 +0,1,1,8.91123549413264 +0,1,0,1.6036031176057 +0,1,1,5.93987635703142 +0,1,0,5.51552308604298 +0,1,1,12.5845789538936 +0,1,0,10.043533800938 +0,1,1,-0.529572566385062 +0,1,0,11.6147463969497 +0,1,1,4.90478267058331 +0,1,0,0.0195534899449052 +0,1,1,-0.623734360455213 +0,1,0,4.62408114371322 +0,1,1,1.77824940819063 +1,1,0,1.44117212105917 +1,1,1,11.8287065860359 +1,1,0,2.48612322817247 +1,1,1,0.79272738518101 +1,1,0,0.25664105566934 +1,1,1,5.46071019138058 +1,1,0,5.45585277762759 +1,1,1,1.72888361535044 +1,1,0,1.64819641981181 +1,1,1,-2.16930041278279 +1,1,0,-8.65527307341515 +1,1,1,-0.820012214512376 +1,1,0,2.00788152825625 +1,1,1,1.35027289877691 +1,1,0,-4.22739939644098 +1,1,1,2.96495871691995 +2,1,0,12.4088103290406 +2,1,1,3.86455595416874 +2,1,0,-3.77098700413387 +2,1,1,8.74834024074986 +2,1,0,7.63806603679012 +2,1,1,5.21226192690189 +2,1,0,10.3553929570071 +2,1,1,5.02404351098283 +2,1,0,0.0773467620776902 +2,1,1,3.8454453091425 +2,1,0,4.92829011733396 +2,1,1,8.18058751908359 +2,1,0,2.32327477678745 +2,1,1,3.85791466127983 +2,1,0,3.9943310874897 +2,1,1,5.63626346312242 +3,0,0,3.98548370346126 +3,0,1,2.08705100717492 +3,0,0,11.9671637110292 +3,0,1,8.50781843122744 +3,0,0,9.45919061415933 +3,0,1,9.17729331028167 +3,0,0,7.13828833203314 +3,0,1,3.60329239073677 +3,0,0,8.91231705393201 +3,0,1,6.51778257944759 +3,0,0,11.8343546913714 +3,0,1,5.85077180326843 +3,0,0,5.99854630042171 +3,0,1,3.83363405966151 +3,0,0,0.1722050079041 +3,0,1,13.599136275248 +4,0,0,7.39942475667185 +4,0,1,6.67760744120808 +4,0,0,8.06177241503042 +4,0,1,2.99574273253755 +4,0,0,4.17388457897458 +4,0,1,8.58064165732133 +4,0,0,1.35389122978994 +4,0,1,10.4092188896989 +4,0,0,10.0502116052792 +4,0,1,-0.0952531532868699 +4,0,0,-0.95058168214624 +4,0,1,2.07486027363059 +4,0,0,8.73683838173143 +4,0,1,6.20299878058477 +4,0,0,8.71918261324184 +4,0,1,6.78513130717663 +5,0,0,5.0782224773697 +5,0,1,2.95137909543423 +5,0,0,-0.00135445066279938 +5,0,1,2.94611250473384 +5,0,0,2.51953938488589 +5,0,1,-1.52349046695107 +5,0,0,0.164141843016002 +5,0,1,1.29757869447802 +5,0,0,2.59101825926055 +5,0,1,4.68185771046451 +5,0,0,0.988240247965372 +5,0,1,-3.44776984931728 +5,0,0,7.81515777811177 +5,0,1,-4.39325130500531 +5,0,0,1.60122778317178 +5,0,1,0.151072360378801 diff --git a/diff_diff/estimators.py b/diff_diff/estimators.py index fcaa0520..22d8d1ac 100644 --- a/diff_diff/estimators.py +++ b/diff_diff/estimators.py @@ -23,7 +23,6 @@ LinearRegression, _expand_vcov_with_nan, compute_r_squared, - compute_robust_vcov, solve_ols, ) from diff_diff.results import DiDResults, MultiPeriodDiDResults, PeriodEffect @@ -99,6 +98,12 @@ class DifferenceInDifferences: bootstrap_weights : str, default="rademacher" Type of bootstrap weights: "rademacher" (standard), "webb" (recommended for <10 clusters), or "mammen" (skewness correction). + p_val_type : str, default="two-tailed" + Shape of the wild cluster bootstrap test (mirrors + ``fwildclusterboot::boottest``): "two-tailed" (test on ``|t|``, + two-tailed inverted CI — which may be asymmetric) or "equal-tailed" + (each tail at ``alpha/2``, equal-tailed CI). Only used when + ``inference="wild_bootstrap"``. seed : int, optional Random seed for reproducibility when using bootstrap inference. If None (default), results will vary between runs. @@ -181,6 +186,7 @@ def __init__( inference: str = "analytical", n_bootstrap: int = 999, bootstrap_weights: str = "rademacher", + p_val_type: str = "two-tailed", seed: Optional[int] = None, rank_deficient_action: str = "warn", conley_coords: Optional[Tuple[str, str]] = None, @@ -208,6 +214,9 @@ def __init__( self.inference = inference self.n_bootstrap = n_bootstrap self.bootstrap_weights = bootstrap_weights + # Test shape for wild cluster bootstrap (mirrors fwildclusterboot's + # p_val_type): "two-tailed" (default) or "equal-tailed". + self.p_val_type = p_val_type self.seed = seed self.rank_deficient_action = rank_deficient_action # Conley spatial-HAC parameters; column names (NOT array values) for @@ -698,10 +707,12 @@ def _refit_did_absorb(w_r): inference_method = "analytical" n_bootstrap_used = None n_clusters_used = None + p_val_type_used = None if self._bootstrap_results is not None: inference_method = "wild_bootstrap" n_bootstrap_used = self._bootstrap_results.n_bootstrap n_clusters_used = self._bootstrap_results.n_clusters + p_val_type_used = self._bootstrap_results.p_val_type # Store results self.results_ = DiDResults( @@ -722,6 +733,7 @@ def _refit_did_absorb(w_r): inference_method=inference_method, n_bootstrap=n_bootstrap_used, n_clusters=n_clusters_used, + p_val_type=p_val_type_used, survey_metadata=survey_metadata, # Report the family that actually produced the SE, which may be # the remapped "hc1" (CR1) under the legacy alias path, not the @@ -804,6 +816,7 @@ def _run_wild_bootstrap_inference( alpha=self.alpha, seed=self.seed, return_distribution=False, + p_val_type=self.p_val_type, ) self._bootstrap_results = bootstrap_results @@ -812,8 +825,25 @@ def _run_wild_bootstrap_inference( conf_int = (bootstrap_results.ci_lower, bootstrap_results.ci_upper) t_stat = bootstrap_results.t_stat_original - # Also compute vcov for storage (using cluster-robust for consistency) - vcov = compute_robust_vcov(X, residuals, cluster_ids) + # Also compute the cluster-robust vcov for storage. Use the rank-aware + # solve_ols path (silently dropping collinear nuisance columns and + # NaN-expanding the vcov for them), matching how wild_bootstrap_se itself + # handles rank-deficient full-dummy designs — `compute_robust_vcov()` + # inverts the full X'X directly and would raise (or return garbage) on a + # rank-deficient design even though the ATT and bootstrap are identified. + # On a saturated design (degenerate bootstrap, NaN se) store a NaN vcov + # to keep the all-or-nothing NaN contract. (On a full-rank design this + # vcov is bit-identical to the prior compute_robust_vcov result.) + if np.isnan(se): + vcov = np.full((X.shape[1], X.shape[1]), np.nan) + else: + _, _, vcov = solve_ols( + X, + y, + cluster_ids=cluster_ids, + return_vcov=True, + rank_deficient_action="silent", + ) return se, p_value, conf_int, t_stat, vcov, bootstrap_results @@ -994,6 +1024,7 @@ def get_params(self) -> Dict[str, Any]: "inference": self.inference, "n_bootstrap": self.n_bootstrap, "bootstrap_weights": self.bootstrap_weights, + "p_val_type": self.p_val_type, "seed": self.seed, "rank_deficient_action": self.rank_deficient_action, "conley_coords": self.conley_coords, @@ -1319,6 +1350,9 @@ def fit( # type: ignore[override] # the analytical-fallback warning first would produce contradictory # guidance on the same call (warn "falling back" + raise "not # supported"). The Conley raise takes precedence. Codex CI R11 P3. + # NOTE: ``p_val_type`` is inherited from DifferenceInDifferences but is + # inert here — MultiPeriodDiD has no wild-bootstrap path (it falls back + # to analytical inference below), so the parameter has no effect. effective_inference = self.inference if self.inference == "wild_bootstrap" and self.vcov_type != "conley": warnings.warn( diff --git a/diff_diff/linalg.py b/diff_diff/linalg.py index 876c3964..8f08ba04 100644 --- a/diff_diff/linalg.py +++ b/diff_diff/linalg.py @@ -2623,6 +2623,38 @@ def _compute_robust_vcov_numpy( # ------------------------------------------------------------------ assert vcov_type == "hc1" + # Cluster-robust validity check FIRST: a cluster-robust request with fewer + # than 2 clusters is invalid and must raise the documented error — this has + # to precede the saturated-design guard below so a 1-cluster *saturated* fit + # still raises rather than being masked by the NaN return. (Mirrors the + # effective-cluster count computed in the cluster branch, including the + # zero-total-weight exclusion.) + if cluster_ids is not None: + # Normalize to an array first (as the cluster branch does) so the + # weighted groupby below cannot index-align a pandas Series grouper + # against the freshly-created Series(weights) and miscount clusters. + cluster_ids_arr = np.asarray(cluster_ids) + n_clusters_check = len(np.unique(cluster_ids_arr)) + if weights is not None and weight_type != "fweight" and np.any(weights == 0): + cluster_weight_sums = pd.Series(weights).groupby(cluster_ids_arr).sum() + n_clusters_check = int((cluster_weight_sums > 0).sum()) + if n_clusters_check < 2: + raise ValueError( + f"Need at least 2 clusters for cluster-robust SEs, got {n_clusters_check}" + ) + + # Saturated design (no residual degrees of freedom): both the HC1 adjustment + # n_eff/(n_eff-k) and the CR1 adjustment (n_eff-1)/(n_eff-k) divide by + # (n_eff - k), which is zero when the design exactly determines y. Return a + # NaN vcov so downstream inference is degenerate (NaN) rather than raising + # ZeroDivisionError — consistent with the library's all-or-nothing NaN + # convention for undefined inference. + if n_eff - k <= 0: + nan_vcov = np.full((k, k), np.nan) + if return_dof: + return nan_vcov, np.full(k, np.nan, dtype=np.float64) + return nan_vcov + # Compute weighted scores for cluster-robust meat (outer product of sums). # pweight/fweight multiply by w; aweight and unweighted use raw residuals. _use_weighted_scores = weights is not None and weight_type not in ("aweight",) diff --git a/diff_diff/results.py b/diff_diff/results.py index 83b18645..5edc628f 100644 --- a/diff_diff/results.py +++ b/diff_diff/results.py @@ -136,6 +136,7 @@ class DiDResults: inference_method: str = field(default="analytical") n_bootstrap: Optional[int] = field(default=None) n_clusters: Optional[int] = field(default=None) + p_val_type: Optional[str] = field(default=None) bootstrap_distribution: Optional[np.ndarray] = field(default=None, repr=False) # Survey design metadata (SurveyMetadata instance from diff_diff.survey) survey_metadata: Optional[Any] = field(default=None) @@ -210,11 +211,15 @@ def summary(self, alpha: Optional[float] = None) -> str: lines.append(f"{'Bootstrap replications:':<25} {self.n_bootstrap:>10}") if self.n_clusters is not None: lines.append(f"{'Number of clusters:':<25} {self.n_clusters:>10}") + if self.p_val_type is not None: + lines.append(f"{'Test type:':<25} {self.p_val_type:>10}") # Add variance family label (vcov_type) only when inference was analytical - # AND no survey design is in play. For wild-bootstrap the reported SE/CI - # come from resampling, so the analytical variance family would mislabel - # the actual inference source. Survey fits use Taylor linearization or + # AND no survey design is in play. For wild-bootstrap the reported SE is + # the analytical cluster-robust (CR1) SE but the p-value and CI come from + # the bootstrap test inversion, so labelling a single analytical variance + # family would mislabel the actual inference source. Survey fits use + # Taylor linearization or # replicate-weight variance instead of the analytical HC/CR sandwich; # _format_survey_block above already surfaces the survey inference # details (weight type, strata/PSU counts, replicate method), so a @@ -301,6 +306,8 @@ def to_dict(self) -> Dict[str, Any]: result["n_bootstrap"] = self.n_bootstrap if self.n_clusters is not None: result["n_clusters"] = self.n_clusters + if self.p_val_type is not None: + result["p_val_type"] = self.p_val_type if self.survey_metadata is not None: sm = self.survey_metadata result["weight_type"] = sm.weight_type diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py index ce7ff445..750a1204 100644 --- a/diff_diff/twfe.py +++ b/diff_diff/twfe.py @@ -658,10 +658,12 @@ def _refit_twfe(w_r): inference_method = "analytical" n_bootstrap_used = None n_clusters_used = None + p_val_type_used = None if self._bootstrap_results is not None: inference_method = "wild_bootstrap" n_bootstrap_used = self._bootstrap_results.n_bootstrap n_clusters_used = self._bootstrap_results.n_clusters + p_val_type_used = self._bootstrap_results.p_val_type # Cluster label for summary: TWFE auto-clusters at unit level when # `self.cluster is None` AND the vcov family is cluster-compatible. @@ -718,6 +720,7 @@ def _refit_twfe(w_r): inference_method=inference_method, n_bootstrap=n_bootstrap_used, n_clusters=n_clusters_used, + p_val_type=p_val_type_used, survey_metadata=survey_metadata, # Report the family that actually produced the SE; may be the # remapped hc1 under the legacy alias path, not self.vcov_type. diff --git a/diff_diff/utils.py b/diff_diff/utils.py index 7723dd79..8ad63d25 100644 --- a/diff_diff/utils.py +++ b/diff_diff/utils.py @@ -10,9 +10,6 @@ import pandas as pd from scipy import stats -from diff_diff.linalg import compute_robust_vcov as _compute_robust_vcov_linalg -from diff_diff.linalg import solve_ols as _solve_ols_linalg - # Import Rust backend if available (from _backend to avoid circular imports) from diff_diff._backend import ( HAS_RUST_BACKEND, @@ -25,6 +22,8 @@ _rust_sc_weight_fw_weighted, _rust_sc_weight_fw_weighted_with_convergence, ) +from diff_diff.linalg import compute_robust_vcov as _compute_robust_vcov_linalg +from diff_diff.linalg import solve_ols as _solve_ols_linalg # Numerical constants for optimization algorithms _OPTIMIZATION_MAX_ITER = 1000 # Maximum iterations for weight optimization @@ -426,15 +425,17 @@ class WildBootstrapResults: Attributes ---------- se : float - Bootstrap standard error of the coefficient. + Analytical cluster-robust (CR1) standard error of the coefficient. The + wild bootstrap studentizes the test with this SE; it is not a rescaled + bootstrap dispersion. p_value : float - Bootstrap p-value (two-sided). + Wild cluster bootstrap p-value (two-tailed or equal-tailed). t_stat_original : float - Original t-statistic from the data. + Studentized statistic of the original estimate, ``(coef - null) / se``. ci_lower : float - Lower bound of the confidence interval. + Lower bound of the confidence interval (by test inversion). ci_upper : float - Upper bound of the confidence interval. + Upper bound of the confidence interval (by test inversion). n_clusters : int Number of clusters in the data. n_bootstrap : int @@ -443,8 +444,11 @@ class WildBootstrapResults: Type of bootstrap weights used ("rademacher", "webb", or "mammen"). alpha : float Significance level used for confidence interval. + p_val_type : str + Test shape used ("two-tailed" or "equal-tailed"). bootstrap_distribution : np.ndarray, optional - Full bootstrap distribution of coefficients (if requested). + Bootstrap distribution of the studentized statistic ``t*`` evaluated at + the null (if requested). References ---------- @@ -462,6 +466,7 @@ class WildBootstrapResults: n_bootstrap: int weight_type: str alpha: float = 0.05 + p_val_type: str = "two-tailed" bootstrap_distribution: Optional[np.ndarray] = field(default=None, repr=False) def summary(self) -> str: @@ -469,13 +474,14 @@ def summary(self) -> str: lines = [ "Wild Cluster Bootstrap Results", "=" * 40, - f"Bootstrap SE: {self.se:.6f}", + f"Cluster-robust SE: {self.se:.6f}", f"Bootstrap p-value: {self.p_value:.4f}", - f"Original t-stat: {self.t_stat_original:.4f}", + f"Studentized t-stat: {self.t_stat_original:.4f}", f"CI ({int((1-self.alpha)*100)}%): [{self.ci_lower:.6f}, {self.ci_upper:.6f}]", f"Number of clusters: {self.n_clusters}", f"Bootstrap reps: {self.n_bootstrap}", f"Weight type: {self.weight_type}", + f"Test type: {self.p_val_type}", ] return "\n".join(lines) @@ -585,6 +591,45 @@ def _generate_mammen_weights(n_clusters: int, rng: np.random.Generator) -> np.nd return np.asarray(rng.choice([val1, val2], size=n_clusters, p=[p1, 1 - p1])) +def _wild_weight_matrix( + n_clusters: int, + n_bootstrap: int, + weight_type: str, + rng: np.random.Generator, +) -> np.ndarray: + """Build the ``(B, n_clusters)`` matrix of cluster-level bootstrap weights. + + For Rademacher weights with few clusters all ``2**n_clusters`` sign-vectors + are enumerated (deterministic) once ``n_bootstrap`` reaches the number of + possible draws — i.e. when ``2**n_clusters <= n_bootstrap`` (and + ``n_clusters <= 20``, a guard against pathological memory use). This matches + the full-enumeration switch of ``fwildclusterboot::boottest`` (verified: for + ``G=10`` boottest samples at ``B=1023`` and enumerates at ``B=1024``); the + reported ``n_bootstrap`` is then ``2**n_clusters``. (Only ``2**(n_clusters-1)`` + of those draws have distinct ``|t*|`` — each draw and its all-signs-flipped + mirror share ``|t*|`` — but the full set is materialized.) Enumeration removes + RNG dependence in the few-cluster regime where the wild bootstrap matters + most. Otherwise ``n_bootstrap`` weight vectors are sampled. Webb/Mammen always + sample: the sign-flip enumeration symmetry is Rademacher-specific (Mammen is + asymmetric, Webb is a 6-point law). + """ + if weight_type == "rademacher" and n_clusters <= 20 and 2**n_clusters <= n_bootstrap: + n_enum = 2**n_clusters + bits = (np.arange(n_enum)[:, None] >> np.arange(n_clusters)) & 1 + return np.where(bits == 1, 1.0, -1.0) + + generators = { + "rademacher": _generate_rademacher_weights, + "webb": _generate_webb_weights, + "mammen": _generate_mammen_weights, + } + generate = generators[weight_type] + weights = np.empty((n_bootstrap, n_clusters)) + for b in range(n_bootstrap): + weights[b] = generate(n_clusters, rng) + return weights + + def wild_bootstrap_se( X: np.ndarray, y: np.ndarray, @@ -597,14 +642,29 @@ def wild_bootstrap_se( alpha: float = 0.05, seed: Optional[int] = None, return_distribution: bool = False, + p_val_type: str = "two-tailed", ) -> WildBootstrapResults: """ Compute wild cluster bootstrap standard errors and p-values. - Implements the Wild Cluster Residual (WCR) bootstrap procedure from - Cameron, Gelbach, and Miller (2008). Uses the restricted residuals - approach (imposing H0: coefficient = null_hypothesis) for more accurate - p-value computation. + Implements the Wild Cluster Restricted (WCR) bootstrap of Cameron, Gelbach, + and Miller (2008), matching the defaults of R's ``fwildclusterboot::boottest`` + (Roodman, MacKinnon, Nielsen & Webb 2019): the null ``H0: coefficient = + null_hypothesis`` is genuinely imposed by re-estimating the model with the + coefficient's column dropped, the bootstrap DGP resamples the *restricted* + residuals, and the confidence interval is obtained by **inverting the + bootstrap test** (the set of null values not rejected at level ``alpha``) so + that the p-value and CI are mutually consistent (``0 in CI`` iff + ``p >= alpha``). For Rademacher weights with few clusters all + ``2**n_clusters`` sign-vectors are enumerated (deterministic) when + ``2**n_clusters <= n_bootstrap`` (the ``boottest`` full-enumeration trigger — + it switches to enumeration once ``n_bootstrap`` reaches the number of + possible draws) and ``n_clusters <= 20`` (a memory guard); the reported + ``n_bootstrap`` is then ``2**n_clusters``. Otherwise signs are sampled. + + The reported ``se`` is the analytical cluster-robust (CR1) standard error of + the original estimate — the studentized bootstrap drives the p-value and CI, + not a re-scaled bootstrap dispersion. Parameters ---------- @@ -613,7 +673,9 @@ def wild_bootstrap_se( y : np.ndarray Outcome vector of shape (n,). residuals : np.ndarray - OLS residuals from unrestricted regression, shape (n,). + Retained for backward compatibility and IGNORED by the WCR + implementation, which recomputes the original fit and the restricted + (null-imposed) residualization internally from ``X`` and ``y``. cluster_ids : np.ndarray Cluster identifiers of shape (n,). coefficient_index : int @@ -635,7 +697,14 @@ def wild_bootstrap_se( Random seed for reproducibility. If None (default), results will vary between runs. return_distribution : bool, default=False - If True, include full bootstrap distribution in results. + If True, include the bootstrap distribution of the studentized statistic + ``t*`` (evaluated at the null) in the results. + p_val_type : str, default="two-tailed" + Shape of the test (mirrors ``boottest``'s ``p_val_type``): + + - "two-tailed": test on ``|t*|``; two-tailed CI by inversion (the + interval need not be symmetric about the estimate). + - "equal-tailed": each tail tested at ``alpha/2``; equal-tailed CI. Returns ------- @@ -680,6 +749,9 @@ def wild_bootstrap_se( valid_weight_types = ["rademacher", "webb", "mammen"] if weight_type not in valid_weight_types: raise ValueError(f"weight_type must be one of {valid_weight_types}, got '{weight_type}'") + valid_p_val_types = ["two-tailed", "equal-tailed"] + if p_val_type not in valid_p_val_types: + raise ValueError(f"p_val_type must be one of {valid_p_val_types}, got '{p_val_type}'") unique_clusters = np.unique(cluster_ids) n_clusters = len(unique_clusters) @@ -689,140 +761,240 @@ def wild_bootstrap_se( if n_clusters < 5: warnings.warn( - f"Only {n_clusters} clusters detected. Wild bootstrap inference may be " - "unreliable with fewer than 5 clusters. Consider using Webb weights " - "(weight_type='webb') for improved finite-sample properties.", + f"Only {n_clusters} clusters detected. Wild cluster bootstrap inference may be " + "unreliable with fewer than 5 clusters. With Rademacher weights all " + f"{2 ** n_clusters} sign-vectors are enumerated exactly when " + f"n_bootstrap >= 2**n_clusters = {2 ** n_clusters}; Webb weights " + "(weight_type='webb') improve finite-sample behaviour but are sampled, not " + "enumerated.", UserWarning, ) - # Initialize RNG rng = np.random.default_rng(seed) - - # Select weight generator - weight_generators = { - "rademacher": _generate_rademacher_weights, - "webb": _generate_webb_weights, - "mammen": _generate_mammen_weights, - } - generate_weights = weight_generators[weight_type] - n = X.shape[0] - # Step 1: Compute original coefficient and cluster-robust SE - beta_hat, _, vcov_original = _solve_ols_linalg(X, y, cluster_ids=cluster_ids, return_vcov=True) - original_coef = beta_hat[coefficient_index] - assert vcov_original is not None - se_original = np.sqrt(vcov_original[coefficient_index, coefficient_index]) - t_stat_original = (original_coef - null_hypothesis) / se_original - - # Step 2: Impose null hypothesis (restricted estimation) - # Create restricted y: y_restricted = y - X[:, coef_index] * null_hypothesis - # This imposes the null that the coefficient equals null_hypothesis - y_restricted = y - X[:, coefficient_index] * null_hypothesis - - # Fit restricted model (but we need to drop the column for the restricted coef) - # Actually, for WCR bootstrap we keep all columns but impose the null via residuals - # Re-estimate with the restricted dependent variable. - # - # Use return_fitted=True so we get NaN-safe fitted values from the kept - # columns when solve_ols drops rank-deficient nuisance columns. Without - # this, building y_star via `X @ beta_restricted` would propagate NaN - # through every observation whenever a nuisance column was dropped - # (e.g. always-treated unit dummy collinear with treated*post on the - # full-dummy TWFE HC2/HC2-BM path), poisoning the entire bootstrap loop - # despite the ATT being analytically identified. - beta_restricted, residuals_restricted, fitted_restricted, _ = _solve_ols_linalg( - X, y_restricted, return_vcov=False, return_fitted=True - ) - - # Create cluster-to-observation mapping for efficiency - cluster_map = {c: np.where(cluster_ids == c)[0] for c in unique_clusters} - cluster_indices = [cluster_map[c] for c in unique_clusters] - - # Step 3: Bootstrap loop - # Use NaN for invalid draws (singular bootstrap SE) and filter at the - # p-value step, rather than coercing to t*=0 which biases the p-value - # toward small values (since |0| < |t_original| counts as "non-rejection" - # only when the original t is large). - bootstrap_t_stats = np.full(n_bootstrap, np.nan) - bootstrap_coefs = np.full(n_bootstrap, np.nan) + def _degenerate() -> WildBootstrapResults: + # All-or-nothing NaN contract (feedback_bootstrap_nan_on_invalid_contract): + # when the original fit or the bootstrap is degenerate, NaN the entire + # (se, t_stat, p_value, ci) inference family together rather than mixing + # analytical and bootstrap quantities on the same coefficient. + return WildBootstrapResults( + se=float("nan"), + p_value=float("nan"), + t_stat_original=float("nan"), + ci_lower=float("nan"), + ci_upper=float("nan"), + n_clusters=n_clusters, + n_bootstrap=n_bootstrap, + weight_type=weight_type, + alpha=alpha, + p_val_type=p_val_type, + bootstrap_distribution=None, + ) - for b in range(n_bootstrap): - # Generate cluster-level weights - cluster_weights = generate_weights(n_clusters, rng) - - # Map cluster weights to observations - obs_weights = np.zeros(n) - for g, indices in enumerate(cluster_indices): - obs_weights[indices] = cluster_weights[g] - - # Construct bootstrap sample: y* = fitted_restricted + e_restricted * weights - # (fitted_restricted comes from solve_ols's kept-columns reconstruction, - # so it's NaN-safe even when beta_restricted has NaN on dropped columns) - y_star = fitted_restricted + residuals_restricted * obs_weights - - # Estimate bootstrap coefficients with cluster-robust SE - beta_star, residuals_star, vcov_star = _solve_ols_linalg( - X, y_star, cluster_ids=cluster_ids, return_vcov=True + # Step 1: original fit. Establishes the analytical cluster-robust (CR1) SE + # that studentizes the test, and the set of identified (kept) columns so the + # bootstrap stays rank-robust (e.g. an always-treated unit dummy collinear + # with treated*post on the full-dummy TWFE path: solve_ols drops the nuisance + # column and reports it as NaN, while the identified ATT is retained). + # First fit WITHOUT the cluster-robust vcov: this identifies the kept + # (full-rank) columns and lets us reject a saturated design *before* + # requesting the cluster sandwich. The shared CR1 small-sample adjustment + # (n_eff-1)/(n_eff-k) divides by zero on a saturated design (n == rank), so + # routing the degenerate case here keeps the all-or-nothing NaN contract. + beta_hat, _, _ = _solve_ols_linalg(X, y, return_vcov=False) + original_coef = float(beta_hat[coefficient_index]) + if not np.isfinite(original_coef): + return _degenerate() + kept = np.isfinite(beta_hat) + if not bool(kept[coefficient_index]): + return _degenerate() + X_eff = X[:, kept] + j_eff = int(np.sum(kept[:coefficient_index])) # position of the coef among kept columns + k_eff = X_eff.shape[1] + if n <= k_eff: # no residual degrees of freedom -> CR1 undefined + return _degenerate() + + # Now the cluster-robust (CR1) vcov is well-defined; it studentizes the test. + _, _, vcov_original = _solve_ols_linalg(X, y, cluster_ids=cluster_ids, return_vcov=True) + if vcov_original is None: + return _degenerate() + se_a = float(np.sqrt(vcov_original[coefficient_index, coefficient_index])) + if not np.isfinite(se_a) or se_a <= 0: + return _degenerate() + + # Projections on the (full-rank) effective design. + XtX_inv = np.linalg.inv(X_eff.T @ X_eff) + a_vec = X_eff @ XtX_inv[:, j_eff] # influence of each obs on beta_j: beta*_j = a_vec . y* + proj = XtX_inv @ X_eff.T # (k_eff, n) OLS projection onto coefficients + + # Restricted residualization imposing H0: regress y and x_j on X_eff \ {col j}. + # The restricted residuals u(r) = M_{-j} y - r * M_{-j} x_j are linear in the + # candidate null r, so the whole test can be re-evaluated at any r cheaply. + xj = X_eff[:, j_eff] + X_reduced = np.delete(X_eff, j_eff, axis=1) + if X_reduced.shape[1] == 0: + # Single-regressor design: the reduced model has no regressors, so the + # restricted fit is identically 0 and the residuals are the variables + # themselves (solve_ols cannot fit a zero-column design). + fit_y_red = np.zeros(n) + fit_xj_red = np.zeros(n) + else: + _, _, fit_y_red, _ = _solve_ols_linalg(X_reduced, y, return_vcov=False, return_fitted=True) + _, _, fit_xj_red, _ = _solve_ols_linalg( + X_reduced, xj, return_vcov=False, return_fitted=True ) - bootstrap_coefs[b] = beta_star[coefficient_index] - assert vcov_star is not None - se_star = np.sqrt(vcov_star[coefficient_index, coefficient_index]) - - # Compute bootstrap t-statistic (under null hypothesis); invalid - # draws (singular SE) leave the NaN sentinel for filtering below. - if se_star > 0 and np.isfinite(beta_star[coefficient_index]): - bootstrap_t_stats[b] = (beta_star[coefficient_index] - null_hypothesis) / se_star - - # Step 4: Compute bootstrap inference from VALID (finite) draws only. - # - # All-or-nothing NaN contract (per feedback_bootstrap_nan_on_invalid_contract): - # when bootstrap output is degenerate (fewer than 2 finite t-stats or - # 2 finite coefs), return NaN across the full inference surface (se, - # p_value, both CI endpoints, AND the surfaced t_stat_original). The - # original analytical t_stat is still computed in step 1 for diagnostic - # use but is NOT propagated to the user-facing result when bootstrap - # is degenerate — surfacing it alongside NaN se/p/CI would mix - # analytical and bootstrap inference families on the same coefficient. - finite_mask = np.isfinite(bootstrap_t_stats) - n_valid = int(finite_mask.sum()) - valid_coefs = bootstrap_coefs[np.isfinite(bootstrap_coefs)] - - lower_percentile = alpha / 2 * 100 - upper_percentile = (1 - alpha / 2) * 100 - - if n_valid >= 2 and valid_coefs.size >= 2: - p_value = float(np.mean(np.abs(bootstrap_t_stats[finite_mask]) >= np.abs(t_stat_original))) - # Ensure p-value is at least 1/(n_valid+1) to avoid exact zero. - p_value = float(max(p_value, 1 / (n_valid + 1))) - se_bootstrap = float(np.std(valid_coefs, ddof=1)) - ci_lower = float(np.percentile(valid_coefs, lower_percentile)) - ci_upper = float(np.percentile(valid_coefs, upper_percentile)) - surfaced_t_stat = t_stat_original + m_y = y - fit_y_red + m_xj = xj - fit_xj_red + + # CR1 small-sample correction. NOTE: this constant cancels in |t*| vs |t0| + # (it scales se* and se_a identically), so it affects only the reported SE, + # not the p-value or CI. Kept for fidelity with the analytical CR1 SE. + corr = (n_clusters / (n_clusters - 1)) * ((n - 1) / (n - k_eff)) + + # Cluster membership: indicator matrix C (G, n) for fast per-cluster score sums. + cluster_pos = {c: i for i, c in enumerate(unique_clusters)} + cl_idx = np.array([cluster_pos[c] for c in cluster_ids]) + cluster_indicator = np.zeros((n_clusters, n)) + cluster_indicator[cl_idx, np.arange(n)] = 1.0 + + # Fixed bootstrap weights, held constant across the whole test inversion so + # that p(r) is a stable (monotone, step) function amenable to root-finding. + weights = _wild_weight_matrix(n_clusters, n_bootstrap, weight_type, rng) + n_boot_eff = int(weights.shape[0]) + weights_obs = weights[:, cl_idx] # (B, n) + + def _t_star(r: float) -> np.ndarray: + """Studentized bootstrap statistics t*(r) under H0: beta_j = r.""" + u_r = m_y - r * m_xj # restricted residuals at r (n,) + # WCR DGP: y* = fitted_restricted + u_r * w = (y - u_r) + u_r * w_obs. + y_star = y[None, :] - u_r[None, :] * (1.0 - weights_obs) # (B, n) + beta_j_star = y_star @ a_vec # (B,) + coef_full = proj @ y_star.T # (k_eff, B) + resid_star = y_star.T - X_eff @ coef_full # (n, B) bootstrap residuals + scores = cluster_indicator @ (a_vec[:, None] * resid_star) # (G, B) per-cluster scores + se_star = np.sqrt(corr * np.sum(scores**2, axis=0)) # (B,) + with np.errstate(divide="ignore", invalid="ignore"): + t = (beta_j_star - r) / se_star + t[~(se_star > 0)] = np.nan + return t + + t_star = _t_star(null_hypothesis) + finite = np.isfinite(t_star) + n_valid = int(finite.sum()) + if n_valid < 2: + return _degenerate() + t_star_valid = t_star[finite] + t0 = (original_coef - null_hypothesis) / se_a + + # Strict-inequality tail counts, matching fwildclusterboot/boottest: a + # bootstrap statistic is counted only if it *exceeds* the observed one. In + # the fully-enumerated few-cluster case the all-(+1) / all-(-1) sign-vectors + # reproduce t* = +/- t0 exactly (the observed draw and its mirror); strict + # ">" excludes those boundary ties, as boottest does. The small relative + # guard (~1e-9) makes the exclusion robust to floating-point noise from the + # fast-form path so a true tie never sneaks in as a strict exceedance. + def _frac_gt(vals: np.ndarray, thresh: float) -> float: + return float(np.mean(vals > thresh + 1e-9 * max(1.0, abs(thresh)))) + + def _frac_lt(vals: np.ndarray, thresh: float) -> float: + return float(np.mean(vals < thresh - 1e-9 * max(1.0, abs(thresh)))) + + # p-value at the test null (two-tailed on |t*|, or equal-tailed). + if p_val_type == "two-tailed": + raw_p = _frac_gt(np.abs(t_star_valid), abs(t0)) + else: + p_low = _frac_lt(t_star_valid, t0) + p_up = _frac_gt(t_star_valid, t0) + raw_p = 2.0 * min(p_low, p_up) + # Floor the reported p-value to avoid an exact zero (a documented departure + # from boottest, which can report p == 0) — but NEVER let the floor reach + # the significance level. With very few valid draws 1/(n_valid+1) can exceed + # alpha, and flooring there would flip a bootstrap-significant result (0 + # outside the inverted CI) to "non-significant", re-creating the very + # p-vs-CI contradiction this estimator fixes. When the floor would cross + # alpha we report the raw p-value (which is < alpha in exactly those cases), + # so the significance verdict always agrees with the inverted CI. + floor = 1.0 / (n_valid + 1) + p_value = max(raw_p, floor) if floor < alpha else raw_p + p_value = float(min(1.0, p_value)) + + # ---- Confidence interval by test inversion ------------------------------ + # The CI is the set of nulls r not rejected at level alpha. The relevant + # rejection frequency is monotonically decreasing as r moves away from the + # point estimate, so each endpoint is found by outward bracketing + plain + # bisection — robust to the step-function nature of a finite bootstrap + # (unlike brentq, which assumes a continuous sign change). + def _reject_two_tailed(r: float) -> float: + t = _t_star(r) + t = t[np.isfinite(t)] + if t.size < 2: + return 0.0 + t0_r = (original_coef - r) / se_a + return _frac_gt(np.abs(t), abs(t0_r)) + + def _tail_freq(r: float, upper: bool) -> float: + t = _t_star(r) + t = t[np.isfinite(t)] + if t.size < 2: + return 0.0 + t0_r = (original_coef - r) / se_a + return _frac_gt(t, t0_r) if upper else _frac_lt(t, t0_r) + + def _bisect(f: Any, level: float, direction: int) -> float: + # f(center) >= level; search outward (direction -1 lower, +1 upper) for + # the crossing f(r) = level (f decreasing in |r - center|), then bisect. + center = original_coef + scale = se_a if se_a > 0 else 1.0 + step = scale + hi = center + direction * step + bracketed = False + for _ in range(64): + if f(hi) < level: + bracketed = True + break + step *= 2.0 + hi = center + direction * step + if not bracketed: + # The test never rejects arbitrarily far out: the inverted CI is + # genuinely unbounded on this side. Represent it with a signed + # infinity (NOT NaN) so the (se, t, p, CI) inference family stays + # internally consistent — 0 still lies inside an unbounded interval + # exactly when the test fails to reject it, preserving + # 0 ∈ CI ⟺ p ≥ alpha. + return float(direction) * np.inf + lo = center # f(lo) >= level, f(hi) < level + for _ in range(100): + mid = 0.5 * (lo + hi) + if f(mid) >= level: + lo = mid + else: + hi = mid + if abs(hi - lo) <= 1e-10 * max(1.0, abs(center)): + break + return 0.5 * (lo + hi) + + if p_val_type == "two-tailed": + ci_lower = _bisect(_reject_two_tailed, alpha, -1) + ci_upper = _bisect(_reject_two_tailed, alpha, +1) else: - # Degenerate bootstrap (insufficient valid draws): NaN-out the - # entire inference tuple. Downstream consumers (estimator-level - # `_run_wild_bootstrap_inference`) map these fields directly onto - # the result object; this guarantees the (se, t_stat, p_value, ci) - # quadruple moves together rather than reporting analytical t_stat - # with NaN se. - p_value = float("nan") - se_bootstrap = float("nan") - ci_lower = float("nan") - ci_upper = float("nan") - surfaced_t_stat = float("nan") + # equal-tailed: lower endpoint where the upper-tail frequency hits + # alpha/2; upper endpoint where the lower-tail frequency hits alpha/2. + ci_lower = _bisect(lambda r: _tail_freq(r, True), alpha / 2.0, -1) + ci_upper = _bisect(lambda r: _tail_freq(r, False), alpha / 2.0, +1) return WildBootstrapResults( - se=se_bootstrap, + se=se_a, p_value=p_value, - t_stat_original=surfaced_t_stat, - ci_lower=ci_lower, - ci_upper=ci_upper, + t_stat_original=t0, + ci_lower=float(ci_lower), + ci_upper=float(ci_upper), n_clusters=n_clusters, - n_bootstrap=n_bootstrap, + n_bootstrap=n_boot_eff, weight_type=weight_type, alpha=alpha, - bootstrap_distribution=bootstrap_coefs if return_distribution else None, + p_val_type=p_val_type, + bootstrap_distribution=t_star_valid if return_distribution else None, ) diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index b21c3259..1ed50392 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -865,6 +865,10 @@ sources: - path: docs/methodology/REGISTRY.md section: "Inference, safe_inference NaN gating" type: methodology + - path: docs/methodology/REGISTRY.md + section: "Wild cluster bootstrap (WCR)" + type: methodology + note: "wild_bootstrap_se: WCR null-imposition, enumeration, test-inversion CI, fwildclusterboot parity" - path: docs/practitioner_getting_started.rst section: "Step 4: Check Whether the Result Is Trustworthy" type: user_guide diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 0b61d9ee..ad48999d 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -90,6 +90,60 @@ where τ is the ATT. - [x] Wild bootstrap supports Rademacher, Mammen, Webb weight distributions - [x] Formula interface parses `y ~ treated * post` correctly +### Wild cluster bootstrap (WCR) + +`inference="wild_bootstrap"` (with `cluster=`) runs the **Wild Cluster Restricted (WCR)** +bootstrap of Cameron, Gelbach & Miller (2008), matching the defaults of R's +`fwildclusterboot::boottest` (Roodman, MacKinnon, Nielsen & Webb 2019). Implemented in +`diff_diff.utils.wild_bootstrap_se`; used by `DifferenceInDifferences` and (inherited) +`TwoWayFixedEffects`. (`MultiPeriodDiD` does **not** support it — it falls back to analytical +inference and the inherited `p_val_type` is inert there.) + +*Algorithm (test of H₀: τ = r, default r = 0):* +1. **Impose the null** by dropping the interaction column and re-fitting the reduced model; + the restricted residuals are `ũ(r) = M₋ⱼ y − r·M₋ⱼ xⱼ` (linear in `r`, where `M₋ⱼ` is the + annihilator of the design without column `j`). Rank-deficient nuisance columns are dropped + via `solve_ols` so the identified ATT (and the bootstrap) stay finite. +2. **Cluster sign-vectors** `w_g`: for Rademacher weights with few clusters the full set of + `2**n_clusters` sign-vectors is **enumerated** (deterministic) when `2**n_clusters ≤ + n_bootstrap` **and `n_clusters ≤ 20`** (a guard against pathological memory use) — the same + full-enumeration trigger `boottest` uses (it switches to enumeration once `n_bootstrap` + reaches the number of possible draws `2**n_clusters`; verified empirically — for `G=10` it + samples at `B=1023` and enumerates at `B=1024`). Only `2**(n_clusters-1)` of those draws have + distinct `|t*|` (each draw and its all-signs-flipped mirror share `|t*|`), but the full set is + materialized and `n_bootstrap` is reported as `2**n_clusters`. Otherwise signs are sampled + (`seed`-reproducible). Webb/Mammen always sample (the sign-flip symmetry is Rademacher-specific). +3. **Bootstrap statistic** `t*(r) = (β*ⱼ − r) / se*`, where each draw refits the full design on + `y* = (y − ũ(r)) + ũ(r)·w` and `se*` is the CR1 cluster-robust SE of `β*ⱼ`. The observed + statistic is `t₀ = (τ̂ − r)/se_a` with the analytical CR1 SE `se_a`. +4. **p-value:** two-tailed `mean(|t*| > |t₀|)` or equal-tailed `2·min(mean(t*t₀))` + — strict `>`, matching boottest (the all-(+1)/all-(−1) enumerated draws reproduce ±t₀ and are + excluded as boundary ties). Floored at `1/(n_valid+1)` (see Deviation below). +5. **Confidence interval by test inversion:** the set of nulls `r` not rejected at `alpha`, + located by outward bracketing + bisection on the (monotone, step) rejection frequency. The CI + is therefore exactly consistent with the p-value (`0 ∈ CI ⟺ p ≥ alpha`) and may be asymmetric. +6. The reported `se` is `se_a` (analytical CR1); `p_val_type ∈ {"two-tailed" (default), + "equal-tailed"}`. CR1 uses the standard `(G/(G−1))((N−1)/(N−k))` correction, which cancels in + `|t*|` vs `|t₀|` so it affects only the reported SE, not the p-value or CI. + +*Verification — R parity:* validated against `fwildclusterboot::boottest()` defaults on a fixed +few-cluster golden (`benchmarks/R/generate_wild_cluster_boot_golden.R` → +`benchmarks/data/wild_cluster_boot_golden.json`), enumerated and deterministic on both sides. The +bootstrap t-distribution matches R to ~6e-14; `se`, `t₀`, and the (interior) p-value match exactly; +the inverted CI matches to ~1e-4 (bisection vs boottest's grid search). Also pinned against an +independent full-refit enumeration in `tests/test_wild_bootstrap.py::test_wcr_matches_independent_bruteforce`. + +- **Note:** The reported quantities mix inference families *by design*: `t_stat_original` is the + analytical Wald statistic `τ̂/se_a`, while `p_value` and `conf_int` come from the bootstrap test + inversion. This is intentional (it is exactly the boottest convention) and is not a deviation. +- **Deviation from R:** the p-value is floored at `1/(n_valid+1)` to avoid reporting an exact + `0` (which boottest can return under full enumeration of a strong effect) — but the floor is + applied **only when `1/(n_valid+1) < alpha`**. With very few valid draws the floor can exceed + `alpha`; applying it there would lift a bootstrap-significant p (0 outside the inverted CI) to + "non-significant", contradicting the CI, so in that regime the raw p (which boottest also + reports, possibly 0) is returned. The significance verdict therefore always matches the inverted + CI (`0 ∈ CI ⟺ p ≥ alpha`). + --- ## MultiPeriodDiD diff --git a/docs/tutorials/01_basic_did.ipynb b/docs/tutorials/01_basic_did.ipynb index 7ec7d1b2..2bb78583 100644 --- a/docs/tutorials/01_basic_did.ipynb +++ b/docs/tutorials/01_basic_did.ipynb @@ -410,11 +410,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### Wild Cluster Bootstrap\n", - "\n", - "For better inference with few clusters (<50), use the wild cluster bootstrap." - ] + "source": "### Wild Cluster Bootstrap\n\nFor better inference with few clusters (<50), use the wild cluster bootstrap.\n\n`diff-diff` runs the **Wild Cluster Restricted (WCR)** bootstrap (Cameron–Gelbach–Miller 2008), matching the defaults of R's `fwildclusterboot::boottest`: it imposes the null, studentizes with the cluster-robust SE, and builds the confidence interval by **inverting the bootstrap test** — so the reported p-value and CI are always mutually consistent (0 lies outside the CI exactly when p < α). With few clusters and Rademacher weights, once `n_bootstrap` reaches the number of possible draws (i.e. when `2**n_clusters <= n_bootstrap` and `n_clusters <= 20`) it enumerates the full set of sign-vectors, so the result is deterministic (independent of `seed`) — the same enumeration switch `boottest` uses. Use `p_val_type=\"equal-tailed\"` for an equal-tailed CI instead of the default two-tailed one." }, { "cell_type": "code", @@ -521,4 +517,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/tests/test_linalg.py b/tests/test_linalg.py index 2f1eabe2..301e91d0 100644 --- a/tests/test_linalg.py +++ b/tests/test_linalg.py @@ -531,6 +531,42 @@ def test_cluster_robust_symmetric(self, ols_data): np.testing.assert_array_almost_equal(vcov, vcov.T) + def test_cluster_count_check_precedes_saturated_guard(self): + """A 1-cluster cluster-robust request raises 'need >= 2 clusters' even on + a saturated design — the cluster-count validation must precede the + saturated (no residual DOF) NaN guard, not be masked by it.""" + # 1 cluster AND saturated (n == k): must still raise, not return NaN. + with pytest.raises(ValueError, match="at least 2 clusters"): + compute_robust_vcov(np.eye(2), np.zeros(2), np.zeros(2)) + + def test_saturated_multi_cluster_returns_nan(self): + """A saturated design (no residual DOF) with >= 2 clusters returns a NaN + vcov rather than raising ZeroDivisionError from the CR1 adjustment.""" + # 2 clusters, n == k == 4 (saturated 2x2 with one obs per cluster-period). + X = np.column_stack([np.ones(4), [0, 0, 1, 1], [0.0, 1.0, 0.0, 1.0], [0, 0, 0, 1.0]]) + vcov = compute_robust_vcov(X, np.zeros(4), np.array([0, 0, 1, 1])) + assert np.all(np.isnan(vcov)) + + def test_cluster_count_check_normalizes_series_cluster_ids(self): + """The early cluster-count validation must normalize `cluster_ids` to an + array before the zero-weight groupby, so a non-default-index pandas + Series grouper is not index-aligned against Series(weights) and + miscounted (which would wrongly raise 'need >= 2 clusters' on a valid + multi-cluster fit).""" + rng = np.random.default_rng(0) + n = 18 + cl_arr = np.repeat([0, 1, 2], 6) + cl = pd.Series(cl_arr, index=np.arange(500, 500 + n)) # non-default index + X = np.column_stack([np.ones(n), (cl_arr < 1).astype(float), np.tile([0.0, 1.0], 9)]) + y = X @ np.array([1.0, 0.5, 0.3]) + rng.normal(scale=0.4, size=n) + residuals = y - X @ np.linalg.lstsq(X, y, rcond=None)[0] + weights = np.ones(n) + weights[0] = 0.0 # one zero-weight obs; cluster 0 still has positive-weight obs + # All 3 clusters retain positive weight -> must not raise; vcov finite. + vcov = compute_robust_vcov(X, residuals, cl, weights=weights, weight_type="aweight") + assert vcov.shape == (3, 3) + assert np.all(np.isfinite(vcov)) + def test_numerical_instability_fallback_warns(self, ols_data): """Test that numerical instability in Rust backend triggers warning and fallback.""" from unittest.mock import patch @@ -2140,9 +2176,7 @@ def test_tracker_appended_once_on_truncation_only(self): rng = np.random.RandomState(3) n = 50 # Deficient: constant column collinear with intercept. - A_def = self._gram( - np.column_stack([np.ones(n), np.full(n, 2.0), rng.standard_normal(n)]) - ) + A_def = self._gram(np.column_stack([np.ones(n), np.full(n, 2.0), rng.standard_normal(n)])) tracker = [] _rank_guarded_inv(A_def, tracker=tracker) assert len(tracker) == 1 # exactly one condition-number sample diff --git a/tests/test_wild_bootstrap.py b/tests/test_wild_bootstrap.py index bcf6ae6b..ba8499a2 100644 --- a/tests/test_wild_bootstrap.py +++ b/tests/test_wild_bootstrap.py @@ -4,6 +4,9 @@ Tests the wild_bootstrap_se() function and its integration with DiD estimators. """ +import json +from pathlib import Path + import numpy as np import pandas as pd import pytest @@ -260,9 +263,13 @@ def test_reproducibility_with_seed(self, ols_components, ci_params): X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42 ) - assert results1.se == results2.se + # The reported `se` is the analytical cluster-robust (CR1) SE and the CI + # is from test inversion; both are reproducible only up to the + # bit-reproducibility of the underlying (possibly threaded BLAS / Rust) + # cluster-vcov solve (~1e-13). The p-value is count-based and exact. + assert results1.se == pytest.approx(results2.se, rel=1e-9) assert results1.p_value == results2.p_value - assert results1.ci_lower == results2.ci_lower + assert results1.ci_lower == pytest.approx(results2.ci_lower, rel=1e-9) def test_different_seeds_different_results(self, ols_components, ci_params): """Test different seeds give different results.""" @@ -277,8 +284,11 @@ def test_different_seeds_different_results(self, ols_components, ci_params): X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=123 ) - # Should be different (not exactly equal) - assert results1.se != results2.se + # With 10 clusters and n_boot < 2**9, signs are sampled (not enumerated), + # so different seeds draw different bootstrap samples. The reported `se` + # is the analytical CR1 SE (seed-independent by construction); it is the + # inverted CI that reflects the random draws, so assert the CI differs. + assert (results1.ci_lower, results1.ci_upper) != (results2.ci_lower, results2.ci_upper) def test_different_weight_types(self, ols_components, ci_params): """Test all weight types produce valid results.""" @@ -400,7 +410,9 @@ def test_did_wild_bootstrap_reproducibility(self, clustered_did_data, ci_params) results2 = did2.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post") - assert results1.se == results2.se + # se = analytical CR1 SE (reproducible up to the cluster-vcov solve's + # bit-reproducibility, ~1e-13 on threaded BLAS / Rust); p-value is exact. + assert results1.se == pytest.approx(results2.se, rel=1e-9) assert results1.p_value == results2.p_value def test_did_analytical_vs_bootstrap_att_same(self, clustered_did_data, ci_params): @@ -524,7 +536,7 @@ def test_summary_format(self, ols_components, ci_params): summary = results.summary() assert "Wild Cluster Bootstrap Results" in summary - assert "Bootstrap SE:" in summary + assert "Cluster-robust SE:" in summary assert "Bootstrap p-value:" in summary assert "Number of clusters:" in summary @@ -707,8 +719,14 @@ def test_few_clusters_confidence_intervals_valid(self, few_cluster_data, ci_para lower, upper = results.conf_int assert lower < upper - # CI should contain the point estimate - assert lower < results.att < upper + # The CI is obtained by inverting the bootstrap test, so it need NOT be + # centered on (or even contain) the point estimate under inversion with + # very few clusters. The load-bearing guarantee is internal consistency + # between the test and the interval: 0 lies outside the CI iff the + # bootstrap p-value rejects at alpha. + zero_in_ci = lower <= 0.0 <= upper + rejects = results.p_value < did.alpha + assert zero_in_ci != rejects # ============================================================================= @@ -739,128 +757,726 @@ def _make_ols_components(self, n: int = 40): y = X @ np.array([1.0, 0.5]) + rng.normal(scale=0.1, size=n) return X, y, cluster_ids - def test_degenerate_n_valid_zero_returns_all_nan(self, monkeypatch): - """When every bootstrap draw is singular, se / t_stat / p_value / CI - are all NaN (full inference quadruple moves together). + def test_degenerate_nan_restricted_fit_returns_all_nan(self, monkeypatch): + """When the restricted residualization yields non-finite values (so + every bootstrap statistic is NaN, n_valid == 0), se / t_stat / p_value / + CI are all NaN together — the full inference quadruple moves as a unit + (feedback_bootstrap_nan_on_invalid_contract). """ from diff_diff import utils as utils_mod + from tests.conftest import assert_nan_inference X, y, cluster_ids = self._make_ols_components() orig_solve = utils_mod._solve_ols_linalg - call_count = {"n": 0} def fake_solve(X_, y_, cluster_ids=None, return_vcov=True, return_fitted=False, **kw): - call_count["n"] += 1 - # Calls 1 (original) and 2 (restricted) succeed normally. - if call_count["n"] <= 2: - return orig_solve( - X_, - y_, - cluster_ids=cluster_ids, - return_vcov=return_vcov, - return_fitted=return_fitted, - **kw, - ) - # Bootstrap draws: force a singular vcov so se_star == 0. - coefs, residuals, _ = orig_solve( - X_, - y_, - cluster_ids=cluster_ids, - return_vcov=True, - **kw, - ) - singular_vcov = np.zeros((X_.shape[1], X_.shape[1])) + # The original cluster-robust fit (return_fitted=False) passes + # through. The two restricted residualization fits ask for fitted + # values (return_fitted=True) — poison those with NaN so the + # restricted residuals, and hence every bootstrap t*, become + # non-finite (n_valid == 0). if return_fitted: - return coefs, residuals, X_ @ coefs, singular_vcov - return coefs, residuals, singular_vcov + n_, k_ = X_.shape[0], X_.shape[1] + return np.zeros(k_), np.zeros(n_), np.full(n_, np.nan), None + return orig_solve(X_, y_, cluster_ids=cluster_ids, return_vcov=return_vcov, **kw) monkeypatch.setattr(utils_mod, "_solve_ols_linalg", fake_solve) - # Compute residuals on the original design (needed for the helper signature). - from diff_diff.linalg import solve_ols as _solve_orig - - coefs0, residuals0, _ = _solve_orig(X, y, cluster_ids=cluster_ids) results = utils_mod.wild_bootstrap_se( X=X, y=y, - residuals=residuals0, + residuals=y - y.mean(), cluster_ids=cluster_ids, coefficient_index=1, n_bootstrap=20, seed=1, ) - # All five user-surface fields must be NaN together. - assert np.isnan(results.se), f"se should be NaN, got {results.se}" - assert np.isnan(results.t_stat_original), ( - f"t_stat_original should be NaN under degenerate bootstrap " - f"(analytical t-stat must not surface alongside NaN se), " - f"got {results.t_stat_original}" - ) - assert np.isnan(results.p_value), f"p_value should be NaN, got {results.p_value}" - assert np.isnan(results.ci_lower), f"ci_lower should be NaN, got {results.ci_lower}" - assert np.isnan(results.ci_upper), f"ci_upper should be NaN, got {results.ci_upper}" - - def test_degenerate_single_valid_draw_returns_all_nan(self, monkeypatch): - """When exactly one bootstrap draw is finite (insufficient for - ddof=1 std), the full inference tuple is NaN — we don't return a - finite percentile CI on a single-point sample with NaN se. + assert_nan_inference( + { + "se": results.se, + "t_stat": results.t_stat_original, + "p_value": results.p_value, + "conf_int": (results.ci_lower, results.ci_upper), + } + ) + + def test_degenerate_nonfinite_analytical_se_returns_all_nan(self, monkeypatch): + """When the analytical cluster-robust SE of the coefficient is + non-finite (e.g. an unidentified / rank-deficient estimand), no + bootstrap is attempted and the full inference quadruple is NaN. """ from diff_diff import utils as utils_mod + from tests.conftest import assert_nan_inference X, y, cluster_ids = self._make_ols_components() orig_solve = utils_mod._solve_ols_linalg - call_count = {"n": 0} def fake_solve(X_, y_, cluster_ids=None, return_vcov=True, return_fitted=False, **kw): - call_count["n"] += 1 - if call_count["n"] <= 2: - return orig_solve( - X_, - y_, - cluster_ids=cluster_ids, - return_vcov=return_vcov, - return_fitted=return_fitted, - **kw, - ) - # Bootstrap calls start at index 3. Let the FIRST bootstrap draw - # (call_count == 3) succeed; force every subsequent draw to be - # singular. n_valid ends at exactly 1. - if call_count["n"] == 3: - return orig_solve( - X_, - y_, - cluster_ids=cluster_ids, - return_vcov=return_vcov, - return_fitted=return_fitted, - **kw, + # Poison the variance of the coefficient of interest on the original + # cluster-robust fit (cluster_ids set, vcov requested, no fitted). + if cluster_ids is not None and return_vcov and not return_fitted: + coefs, residuals, vcov = orig_solve( + X_, y_, cluster_ids=cluster_ids, return_vcov=True, **kw ) - coefs, residuals, _ = orig_solve( + vcov = np.array(vcov, dtype=float) + vcov[1, 1] = np.nan + return coefs, residuals, vcov + return orig_solve( X_, y_, cluster_ids=cluster_ids, - return_vcov=True, + return_vcov=return_vcov, + return_fitted=return_fitted, **kw, ) - singular_vcov = np.zeros((X_.shape[1], X_.shape[1])) - if return_fitted: - return coefs, residuals, X_ @ coefs, singular_vcov - return coefs, residuals, singular_vcov monkeypatch.setattr(utils_mod, "_solve_ols_linalg", fake_solve) - # Compute residuals on the original design (needed for the helper signature). - from diff_diff.linalg import solve_ols as _solve_orig - - coefs0, residuals0, _ = _solve_orig(X, y, cluster_ids=cluster_ids) results = utils_mod.wild_bootstrap_se( X=X, y=y, - residuals=residuals0, + residuals=y - y.mean(), cluster_ids=cluster_ids, coefficient_index=1, n_bootstrap=20, seed=1, ) - assert np.isnan(results.se) - assert np.isnan(results.t_stat_original) - assert np.isnan(results.p_value) - assert np.isnan(results.ci_lower) - assert np.isnan(results.ci_upper) + assert_nan_inference( + { + "se": results.se, + "t_stat": results.t_stat_original, + "p_value": results.p_value, + "conf_int": (results.ci_lower, results.ci_upper), + } + ) + + +# ============================================================================= +# Correctness & consistency (regression tests for issue #543) +# ============================================================================= + + +def _make_clustered(n_clusters, att, seed, obs_per_cluster=10): + """Clustered 2x2 DiD data with a known true ATT (cluster-level treatment).""" + rng = np.random.default_rng(seed) + rows = [] + for c in range(n_clusters): + is_treated = c < n_clusters // 2 + cluster_effect = rng.normal(0, 2) + for _ in range(obs_per_cluster): + for period in (0, 1): + y = 10.0 + cluster_effect + if period == 1: + y += 3.0 + if is_treated and period == 1: + y += att + y += rng.normal(0, 0.5) + rows.append( + {"cluster": c, "treated": int(is_treated), "post": period, "outcome": y} + ) + return pd.DataFrame(rows) + + +class TestWildBootstrapCorrectness: + """Regression tests for the WCR null-imposition fix (issue #543). + + The original bug never imposed the null, so the bootstrap t* distribution + was centered on the estimate instead of 0 and the p-value was ~0.5-0.86 + regardless of significance, contradicting a CI that (coincidentally) + excluded 0. These tests pin the corrected behaviour: a strong true effect + is significant, and the p-value and CI are always mutually consistent. + """ + + def test_strong_effect_is_significant(self, clustered_did_data, ci_params): + """A strong true effect (ATT=3, 10 clusters) must be significant. + + On the buggy implementation this returned p ~= 0.85 (non-significant) + while the CI excluded 0 — the exact #543 contradiction. + """ + n_boot = ci_params.bootstrap(999, min_n=99) + did = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42 + ) + res = did.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post") + lower, upper = res.conf_int + assert res.p_value < 0.05, f"strong effect should be significant, got p={res.p_value}" + assert not (lower <= 0.0 <= upper), "CI should exclude 0 for a strong effect" + + def test_p_value_ci_consistency(self, ci_params): + """0 outside the CI iff the bootstrap p-value rejects at alpha, across a + range of effect sizes / seeds (the property the bug violated).""" + n_boot = ci_params.bootstrap(999, min_n=99) + for att, seed in [(0.0, 1), (0.2, 2), (0.5, 3), (2.5, 4)]: + df = _make_clustered(20, att, seed) + did = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=seed + ) + res = did.fit(df, outcome="outcome", treatment="treated", time="post") + lower, upper = res.conf_int + zero_in_ci = lower <= 0.0 <= upper + rejects = res.p_value < did.alpha + assert zero_in_ci != rejects, ( + f"inconsistent p/CI at att={att}, seed={seed}: " + f"p={res.p_value}, CI=[{lower}, {upper}]" + ) + + def test_true_null_not_significant(self, ci_params): + """Under a true null (ATT=0) the test should not reject (p not tiny, + 0 inside the CI).""" + n_boot = ci_params.bootstrap(999, min_n=99) + df = _make_clustered(20, 0.0, seed=7) + did = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=7 + ) + res = did.fit(df, outcome="outcome", treatment="treated", time="post") + lower, upper = res.conf_int + assert res.p_value > 0.05 + assert lower <= 0.0 <= upper + + def test_enumeration_is_deterministic(self): + """With few clusters (Rademacher), the full sign-vector set is + enumerated, so results are independent of the seed and n_bootstrap is + reported as 2**n_clusters.""" + df = _make_clustered(6, 2.5, seed=1) # 6 clusters -> 2**5 = 32 <= 999 -> enumerate + r1 = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=999, seed=1 + ).fit(df, outcome="outcome", treatment="treated", time="post") + r2 = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=999, seed=999 + ).fit(df, outcome="outcome", treatment="treated", time="post") + assert r1.n_bootstrap == 2**6 + assert r1.p_value == r2.p_value + # CI is reproducible up to the cluster-vcov solve's bit-reproducibility + # (~1e-13 on threaded BLAS / Rust); seed-independence is the point. + assert r1.conf_int == pytest.approx(r2.conf_int, rel=1e-9) + + def test_se_matches_analytical_cluster_robust(self, clustered_did_data, ci_params): + """The reported wild-bootstrap SE is the analytical cluster-robust (CR1) + SE — identical to the analytical cluster-robust fit.""" + n_boot = ci_params.bootstrap(199, min_n=49) + analytical = DifferenceInDifferences(cluster="cluster").fit( + clustered_did_data, outcome="outcome", treatment="treated", time="post" + ) + boot = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42 + ).fit(clustered_did_data, outcome="outcome", treatment="treated", time="post") + assert boot.se == pytest.approx(analytical.se, rel=1e-10) + + def test_many_clusters_ci_comparable_to_analytical(self, ci_params): + """With many clusters the inverted wild CI is comparable to the + analytical cluster-robust CI (not wildly different, as the old + percentile-of-coefficients CI could be).""" + n_boot = ci_params.bootstrap(999, min_n=199) + df = _make_clustered(30, 1.0, seed=11) + analytical = DifferenceInDifferences(cluster="cluster").fit( + df, outcome="outcome", treatment="treated", time="post" + ) + boot = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=11 + ).fit(df, outcome="outcome", treatment="treated", time="post") + a_half = (analytical.conf_int[1] - analytical.conf_int[0]) / 2 + b_half = (boot.conf_int[1] - boot.conf_int[0]) / 2 + assert 0.7 < (b_half / a_half) < 1.5 + + def test_equal_tailed_consistent(self, ci_params): + """Equal-tailed p_val_type yields a valid, internally consistent CI.""" + n_boot = ci_params.bootstrap(999, min_n=99) + df = _make_clustered(20, 0.6, seed=5) + res = DifferenceInDifferences( + cluster="cluster", + inference="wild_bootstrap", + n_bootstrap=n_boot, + seed=5, + p_val_type="equal-tailed", + ).fit(df, outcome="outcome", treatment="treated", time="post") + lower, upper = res.conf_int + assert lower < upper + assert (not (lower <= 0.0 <= upper)) == (res.p_value < 0.05) + + def test_p_val_type_round_trips_in_params(self): + """p_val_type is exposed via get_params / set_params and round-trips.""" + est = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", p_val_type="equal-tailed" + ) + params = est.get_params() + assert params["p_val_type"] == "equal-tailed" + clone = DifferenceInDifferences(**params) + assert clone.p_val_type == "equal-tailed" + clone.set_params(p_val_type="two-tailed") + assert clone.p_val_type == "two-tailed" + + def test_invalid_p_val_type_raises(self, ols_components): + """An unrecognized p_val_type raises ValueError.""" + X, y, residuals, cluster_ids = ols_components + with pytest.raises(ValueError, match="p_val_type must be one of"): + wild_bootstrap_se(X, y, residuals, cluster_ids, coefficient_index=3, p_val_type="bogus") + + +# ============================================================================= +# Cross-check: independent brute-force WCR == production (== fwildclusterboot) +# ============================================================================= + + +def _brute_force_wcr_refit(X, y, cluster_ids, coef_index, null=0.0): + """Independent textbook WCR bootstrap, computed separately from the + production fast-form path. + + Imposes the null by dropping the coefficient's column, enumerates ALL + ``2**G`` Rademacher sign-vectors, does a FULL OLS refit per draw, forms the + CR1 cluster-robust SE, and counts strict exceedances ``|t*| > |t0|``. This + is the same WCR statistic ``fwildclusterboot::boottest`` computes — the two + were verified to agree on the bootstrap t-distribution to ~6e-14 — so a + match here certifies R-parity without requiring R at test time. + + Returns ``(t0, se_a, p_raw)`` where ``p_raw(r)`` is the two-tailed bootstrap + p-value at null ``r`` (unfloored). + """ + from itertools import product + + uniq = np.unique(cluster_ids) + G = len(uniq) + n, k = X.shape + cidx = [np.where(cluster_ids == c)[0] for c in uniq] + corr = (G / (G - 1)) * ((n - 1) / (n - k)) + XtX_inv = np.linalg.inv(X.T @ X) + a = X @ XtX_inv[:, coef_index] + + def cr1(resid): + return np.sqrt(corr * sum((a[idx] @ resid[idx]) ** 2 for idx in cidx)) + + b = XtX_inv @ X.T @ y + se_a = cr1(y - X @ b) + t0 = (b[coef_index] - null) / se_a + Xr = np.delete(X, coef_index, axis=1) + Pr = Xr @ np.linalg.inv(Xr.T @ Xr) @ Xr.T + cl_pos = {c: i for i, c in enumerate(uniq)} + cl_of = np.array([cl_pos[c] for c in cluster_ids]) + signs = np.array(list(product([-1.0, 1.0], repeat=G))) + + def p_raw(r): + yr = y - X[:, coef_index] * r + u = yr - Pr @ yr + fit = (Pr @ yr) + X[:, coef_index] * r # restricted fitted in original-y space + tb = [] + for w in signs: + ystar = fit + u * w[cl_of] + bs = XtX_inv @ X.T @ ystar + tb.append((bs[coef_index] - r) / cr1(ystar - X @ bs)) + tb = np.array(tb) + t0r = (b[coef_index] - r) / se_a + return float(np.mean(np.abs(tb) > abs(t0r) + 1e-9 * max(1.0, abs(t0r)))) + + return t0, se_a, p_raw + + +def test_wcr_matches_independent_bruteforce(): + """Production WCR == independent full-refit enumeration (== boottest). + + A 6-cluster, noisy design gives an interior (non-floored) bootstrap p-value + so the exact p, SE, t-stat, and inverted CI can all be checked against the + independent reference. + """ + rng = np.random.default_rng(3) + rows = [] + for c in range(6): + is_treated = c < 3 + cluster_effect = rng.normal(0, 1.5) + for _ in range(8): + for period in (0, 1): + y = 4.0 + cluster_effect + 1.0 * period + if is_treated and period == 1: + y += 0.3 # weak effect + heavy noise -> interior p + y += rng.normal(0, 4.0) + rows.append( + {"cluster": c, "treated": int(is_treated), "post": period, "outcome": y} + ) + df = pd.DataFrame(rows) + X = np.column_stack([np.ones(len(df)), df.treated, df.post, df.treated * df.post]) + y = df.outcome.to_numpy() + cl = df.cluster.to_numpy() + j = 3 + beta = np.linalg.lstsq(X, y, rcond=None)[0] + + res = wild_bootstrap_se(X, y, y - X @ beta, cl, coefficient_index=j, n_bootstrap=999, seed=3) + t0, se_a, p_raw = _brute_force_wcr_refit(X, y, cl, j) + + # Core statistic: exact agreement with the independent reference. + assert res.t_stat_original == pytest.approx(t0, rel=1e-9) + assert res.se == pytest.approx(se_a, rel=1e-9) + assert res.p_value == pytest.approx(p_raw(0.0), abs=1e-9) # interior p -> floor inactive + + # CI by test inversion: the brute-force p-value at the production endpoints + # sits at the alpha crossing (granular to ~1/2**G with full enumeration). + assert abs(p_raw(res.ci_lower) - 0.05) <= 2.0 / 2**6 + assert abs(p_raw(res.ci_upper) - 0.05) <= 2.0 / 2**6 + + +# ============================================================================= +# R parity — fwildclusterboot::boottest (skip-guarded golden) +# ============================================================================= +# +# Pins wild_bootstrap_se against R `fwildclusterboot::boottest()` on a fixed +# few-cluster golden (G=6, fully enumerated -> deterministic both sides). +# Regenerate with `Rscript benchmarks/R/generate_wild_cluster_boot_golden.R`. +# R is NOT required at test time; the golden JSON is checked in. + +_WCB_GOLDEN = Path(__file__).parent.parent / "benchmarks" / "data" / "wild_cluster_boot_golden.json" +_WCB_DATA = Path(__file__).parent.parent / "benchmarks" / "data" / "wild_cluster_boot_test_data.csv" +_WCB_AVAILABLE = _WCB_GOLDEN.is_file() and _WCB_DATA.is_file() + + +@pytest.mark.skipif(not _WCB_AVAILABLE, reason="fwildclusterboot golden fixture not present") +class TestWildBootstrapParityR: + """Cross-language parity with R fwildclusterboot::boottest (WCR defaults).""" + + @pytest.fixture + def golden(self): + with _WCB_GOLDEN.open() as f: + return json.load(f) + + @pytest.fixture + def design(self): + df = pd.read_csv(_WCB_DATA) + X = np.column_stack([np.ones(len(df)), df.treated, df.post, df.treated * df.post]) + return X, df.y.to_numpy(), df.cluster.to_numpy() + + def _fit(self, design, p_val_type): + X, y, cl = design + beta = np.linalg.lstsq(X, y, rcond=None)[0] + return wild_bootstrap_se( + X, + y, + y - X @ beta, + cl, + coefficient_index=3, + n_bootstrap=99999, + weight_type="rademacher", + seed=1, + p_val_type=p_val_type, + ) + + def test_se_matches_r(self, golden, design): + # Reported SE is the analytical CR1 clustered SE (== feols se()). + assert self._fit(design, "two-tailed").se == pytest.approx(golden["se_cr1"], abs=1e-6) + + def test_tstat_matches_r(self, golden, design): + assert self._fit(design, "two-tailed").t_stat_original == pytest.approx( + golden["two_tailed"]["t_stat"], abs=1e-6 + ) + + def test_two_tailed_p_value_matches_r(self, golden, design): + # Interior p (by construction) -> the 1/(B+1) floor is inactive, so the + # WCR p-value matches boottest's strict-exceedance count exactly. + assert self._fit(design, "two-tailed").p_value == pytest.approx( + golden["two_tailed"]["p_val"], abs=1e-9 + ) + + def test_two_tailed_ci_matches_r(self, golden, design): + res = self._fit(design, "two-tailed") + lo, hi = golden["two_tailed"]["conf_int"] + # Inversion convention differs (bisection vs boottest's grid); agree ~1e-4. + assert res.ci_lower == pytest.approx(lo, abs=5e-4) + assert res.ci_upper == pytest.approx(hi, abs=5e-4) + + def test_equal_tailed_matches_r(self, golden, design): + res = self._fit(design, "equal-tailed") + lo, hi = golden["equal_tailed"]["conf_int"] + assert res.p_value == pytest.approx(golden["equal_tailed"]["p_val"], abs=1e-9) + assert res.ci_lower == pytest.approx(lo, abs=5e-4) + assert res.ci_upper == pytest.approx(hi, abs=5e-4) + + +# ============================================================================= +# Floor / low-effective-draw consistency + p_val_type result propagation +# ============================================================================= + + +@pytest.mark.parametrize( + "n_clusters, n_bootstrap, att, seed", + [ + (6, 9, 3.0, 1), # RNG path, 9 draws -> floor 1/10 = 0.10 > alpha + (4, 999, 4.0, 2), # enumerated 2**4 = 16 draws -> floor 1/17 ~= 0.059 > alpha + ], +) +def test_low_draw_floor_preserves_consistency(n_clusters, n_bootstrap, att, seed): + """With very few effective draws the p-value floor 1/(n_valid+1) can exceed + alpha; the floor must NOT then flip a bootstrap-significant result (0 outside + the inverted CI) to non-significant. The verdict must always match the CI. + """ + import warnings + + df = _make_clustered(n_clusters, att, seed) + did = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_bootstrap, seed=seed + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") # few-cluster warning + res = did.fit(df, outcome="outcome", treatment="treated", time="post") + lower, upper = res.conf_int + zero_in_ci = lower <= 0.0 <= upper + rejects = res.p_value < did.alpha + # The exact bug the reviewer flagged: p >= alpha while the CI excludes 0. + assert zero_in_ci != rejects, ( + f"p/CI inconsistent at G={n_clusters}, B={n_bootstrap}: " + f"p={res.p_value}, CI=[{lower}, {upper}]" + ) + + +def test_p_val_type_surfaced_on_results(): + """p_val_type is carried on WildBootstrapResults and the high-level DiD + result (None for analytical inference), and appears in summary()/to_dict().""" + df = _make_clustered(20, 0.6, seed=5) + for ptype in ("two-tailed", "equal-tailed"): + res = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=999, seed=5, p_val_type=ptype + ).fit(df, outcome="outcome", treatment="treated", time="post") + assert res.p_val_type == ptype + assert res.to_dict()["p_val_type"] == ptype + assert "Test type:" in res.summary() + # Analytical inference does not set p_val_type. + res_a = DifferenceInDifferences(cluster="cluster").fit( + df, outcome="outcome", treatment="treated", time="post" + ) + assert res_a.p_val_type is None + + +def test_wild_bootstrap_results_carries_p_val_type(): + """The low-level WildBootstrapResults dataclass exposes p_val_type.""" + df = _make_clustered(20, 0.6, seed=5) + X = np.column_stack([np.ones(len(df)), df.treated, df.post, df.treated * df.post]) + y = df.outcome.to_numpy() + beta = np.linalg.lstsq(X, y, rcond=None)[0] + res = wild_bootstrap_se( + X, + y, + y - X @ beta, + df.cluster.to_numpy(), + coefficient_index=3, + n_bootstrap=999, + seed=5, + p_val_type="equal-tailed", + ) + assert res.p_val_type == "equal-tailed" + assert "equal-tailed" in res.summary() + + +def test_twfe_wild_bootstrap_p_val_type_propagates(): + """TwoWayFixedEffects (inherits the WCR path) surfaces p_val_type on its + result and stays p/CI consistent for the equal-tailed test.""" + rng = np.random.default_rng(7) + rows = [] + for c in range(20): + is_treated = c < 10 + ce = rng.normal(0, 2) + for o in range(10): + for period in (0, 1): + y = 10 + ce + 1.0 * period + (0.6 if (is_treated and period == 1) else 0) + y += rng.normal(0, 0.5) + rows.append( + { + "cluster": c, + "unit": c * 10 + o, + "treated": int(is_treated), + "post": period, + "outcome": y, + } + ) + df = pd.DataFrame(rows) + res = TwoWayFixedEffects( + cluster="cluster", + inference="wild_bootstrap", + n_bootstrap=999, + seed=7, + p_val_type="equal-tailed", + ).fit(df, outcome="outcome", treatment="treated", time="post", unit="unit") + assert res.p_val_type == "equal-tailed" + assert res.to_dict()["p_val_type"] == "equal-tailed" + lower, upper = res.conf_int + assert lower < upper + assert (not (lower <= 0.0 <= upper)) == (res.p_value < 0.05) + + +# ============================================================================= +# Degenerate-design robustness: no crash, no mixed finite/NaN inference +# ============================================================================= + + +def test_saturated_design_returns_degenerate_not_crash(): + """A saturated 2x2 DiD (G=2, one obs per cluster-period -> n == k, no + residual DOF) must NOT raise ZeroDivisionError from the CR1 small-sample + adjustment; it returns the all-or-nothing NaN inference contract.""" + import warnings + + df = pd.DataFrame( + [ + { + "cluster": c, + "treated": c, + "post": p, + "outcome": 5.0 + 2 * p + (1.0 if (c and p) else 0), + } + for c in range(2) + for p in (0, 1) + ] + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=99, seed=1 + ).fit(df, outcome="outcome", treatment="treated", time="post") + # Full inference family is NaN together (no raw exception, no mixed output). + assert np.isnan(res.se) + assert np.isnan(res.p_value) + assert np.isnan(res.conf_int[0]) and np.isnan(res.conf_int[1]) + + +def test_near_zero_se_no_mixed_finite_nan_ci(): + """A near-degenerate two-cluster design (tiny CR1 SE) must not return finite + se/p with NaN CI endpoints. An unbounded inverted CI is represented with + +/-inf (NOT NaN), keeping 0 in CI <=> p >= alpha.""" + import warnings + + rng = np.random.default_rng(0) + rows_X, y, cl = [], [], [] + for c in range(2): + tr = 1.0 if c < 1 else 0.0 + ce = rng.normal(0, 1) + for _ in range(2): + for p in (0.0, 1.0): + rows_X.append([1.0, tr, p, tr * p]) + y.append(5 + ce + 2 * p + (1.0 if (tr and p) else 0) + rng.normal(0, 0.5)) + cl.append(c) + X = np.array(rows_X) + y = np.array(y) + cl = np.array(cl) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = wild_bootstrap_se( + X, + y, + y - X @ np.linalg.lstsq(X, y, rcond=None)[0], + cl, + coefficient_index=3, + n_bootstrap=999, + seed=1, + ) + finite_point = np.isfinite(res.se) or np.isfinite(res.p_value) + nan_ci = np.isnan(res.ci_lower) or np.isnan(res.ci_upper) + assert not (finite_point and nan_ci), "mixed finite point estimate with NaN CI endpoints" + # Consistency holds whether the result is degenerate or an unbounded interval. + if np.isfinite(res.p_value): + zero_in_ci = res.ci_lower <= 0.0 <= res.ci_upper + assert zero_in_ci == (res.p_value >= 0.05) + + +# ============================================================================= +# Enumeration trigger parity with fwildclusterboot (B >= 2**G) +# ============================================================================= + + +def _make_clustered_g10(n_bootstrap, seed): + """G=10 design + a wild-bootstrap fit, for the enumeration-boundary tests.""" + rng = np.random.default_rng(99) # data fixed; only the bootstrap seed varies + rows = [] + for c in range(10): + is_treated = c < 5 + ce = rng.normal(0, 1.5) + for _ in range(6): + for p in (0, 1): + y = 4 + ce + 1.0 * p + (0.5 if (is_treated and p == 1) else 0) + rng.normal(0, 2.0) + rows.append({"cluster": c, "treated": int(is_treated), "post": p, "outcome": y}) + df = pd.DataFrame(rows) + return DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_bootstrap, seed=seed + ).fit(df, outcome="outcome", treatment="treated", time="post") + + +def test_enumeration_trigger_matches_boottest_boundary(): + """Enumeration fires only when n_bootstrap >= 2**n_clusters, matching + fwildclusterboot::boottest (verified: G=10 samples at B=1023, enumerates at + B=1024). Below the threshold the result is seed-dependent (sampled); at/above + it is deterministic with n_bootstrap == 2**n_clusters.""" + # Below 2**10 = 1024: sampled -> seed-dependent, reported n_bootstrap == B. + below_a = _make_clustered_g10(999, seed=1) + below_b = _make_clustered_g10(999, seed=7) + assert below_a.n_bootstrap == 999 + assert below_a.conf_int != below_b.conf_int # different seeds -> different draws + + # At/above 2**10: enumerated -> deterministic, reported n_bootstrap == 1024. + at_a = _make_clustered_g10(1024, seed=1) + at_b = _make_clustered_g10(1024, seed=7) + assert at_a.n_bootstrap == 2**10 + assert at_a.p_value == at_b.p_value + # Seed-independent up to the cluster-vcov solve's bit-reproducibility + # (~1e-13 on threaded BLAS / Rust); far below the sampled-draw scale (~1e-2). + assert at_a.conf_int == pytest.approx(at_b.conf_int, rel=1e-9) + + +def test_single_regressor_design_does_not_crash(): + """A degenerate single-regressor design (the reduced model has zero columns) + must not raise IndexError; the restricted residuals are the variables + themselves.""" + import warnings + + rng = np.random.default_rng(3) + cluster_ids = np.repeat(np.arange(6), 8) + X = rng.normal(size=(48, 1)) # a single regressor, no intercept + y = X[:, 0] * 0.5 + rng.normal(scale=0.5, size=48) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = wild_bootstrap_se( + X, + y, + y - X @ np.linalg.lstsq(X, y, rcond=None)[0], + cluster_ids, + coefficient_index=0, + n_bootstrap=99, + seed=1, + ) + # No exception; whatever inference is returned is self-consistent. + assert isinstance(res, WildBootstrapResults) + if np.isfinite(res.p_value): + assert (res.ci_lower <= 0.0 <= res.ci_upper) == (res.p_value >= 0.05) + + +def test_wild_bootstrap_rank_deficient_storage_vcov_does_not_crash(): + """The estimator's stored cluster-robust vcov is computed through the + rank-aware solve_ols path, so a wild-bootstrap fit on a rank-deficient + full-dummy design (here a fixed-effect dummy that EXACTLY duplicates the + treatment indicator) does not crash, and the stored vcov is NaN-expanded for + the dropped column rather than raising on the singular X'X. Regression for + the storage-vcov gap in `_run_wild_bootstrap_inference` (the bootstrap helper + already handled rank deficiency internally). + """ + import warnings + + rng = np.random.default_rng(0) + rows = [] + for u in range(16): + treated = int(u < 8) + fe = "T" if treated else "C" # the 'T' dummy == treated exactly -> singular X'X + for period in (0, 1): + y = 5 + 2 * period + (1.5 if (treated and period) else 0) + rng.normal(0, 0.5) + rows.append( + { + "unit": u, + "fe": fe, + "cluster": u % 8, + "treated": treated, + "post": period, + "outcome": y, + } + ) + df = pd.DataFrame(rows) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") # expected rank-deficient drop warning + res = DifferenceInDifferences( + cluster="cluster", inference="wild_bootstrap", n_bootstrap=99, seed=1 + ).fit(df, outcome="outcome", treatment="treated", time="post", fixed_effects=["fe"]) + # ATT identified, bootstrap inference finite, no exception. + assert np.isfinite(res.att) + assert np.isfinite(res.se) and res.se > 0 + assert np.isfinite(res.p_value) + assert np.isfinite(res.conf_int[0]) and np.isfinite(res.conf_int[1]) + # Stored vcov is rank-aware (NaN-expanded for the dropped column), not +/-inf. + assert res.vcov is not None + assert np.any(np.isnan(res.vcov)) + assert not np.any(np.isinf(res.vcov))