diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 74ce9f7c..82bbb9e9 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -82,7 +82,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 | | PreTrendsPower | `pretrends.py` | `pretrends` package | **Complete** | 2026-05-19 | | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **Complete** | 2026-05-31 | -| PlaceboTests | `diagnostics.py` | (no canonical reference) | **In Progress** | — | +| PlaceboTests | `diagnostics.py` | Bertrand-Duflo-Mullainathan (2004) (placebo laws); no canonical R | **In Progress** | — | ### Cross-Cutting Inference Features @@ -1314,18 +1314,20 @@ CI and extending covariate-adjusted R parity are tracked follow-ups in `TODO.md` | Field | Value | |-------|-------| | Module | `diagnostics.py` | -| Primary Reference | None canonical (general permutation / leave-one-out diagnostic) | -| R Reference | None canonical | +| Primary Reference | Bertrand, Duflo & Mullainathan (2004), QJE 119(1):249-275 (placebo laws / randomization inference). Paper review on file: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`. | +| R Reference | None canonical (no R package ships a generic placebo battery) | | Status | **In Progress** | | Last Review | — | **Documentation in place:** - REGISTRY.md section: `## PlaceboTests` (NaN-inference edge cases for `permutation_test` and `leave_one_out_test`) +- Paper review: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md` (BDM 2004 placebo-law / serial-correlation grounding; proposes a `## PlaceboTests` REGISTRY entry, not yet integrated) - Implementation: tests embedded in `tests/test_diagnostics.py` **Outstanding for promotion:** -- Decide whether this surface warrants a standalone methodology review or whether the brief Verified Components walk-through + NaN-inference deviation log should live as a sub-section under each per-estimator diagnostic block instead -- If kept standalone: brief Verified Components block + Deviations block for the NaN-inference convention +- Standalone-vs-absorb decision: **resolved — standalone** (`diagnostics.py` is an exported public surface distinct from per-estimator placebo/LOO) +- Integrate the proposed `## PlaceboTests` entry into REGISTRY.md (cite BDM 2004 + scope) and flip this row to Complete +- Dedicated `tests/test_methodology_placebo.py` with BDM-anchored Verified Components (empirical permutation p-value per fn 12; p-value floor; LOO; fake-timing/fake-group) + Deviations block (permutation path's deliberate non-`safe_inference` + percentile CI; the NaN-inference convention). R parity is N/A (no canonical R placebo battery) → self-consistency / analytic anchors --- @@ -1463,7 +1465,7 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on **Substantive-review-blocked (each still missing one or more of: a methodology test file, R parity, or a paper review):** -1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way. +1. **PlaceboTests** — standalone-vs-absorb decision resolved (standalone) and the BDM (2004) paper review is now on file (`docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`). Remaining for promotion: the dedicated methodology test file + REGISTRY integration (R parity N/A). Methodologically lightweight. **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):** diff --git a/docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md b/docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md new file mode 100644 index 00000000..14639233 --- /dev/null +++ b/docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md @@ -0,0 +1,288 @@ +# Paper Review: How Much Should We Trust Differences-in-Differences Estimates? + +**Authors:** Marianne Bertrand, Esther Duflo, Sendhil Mullainathan +**Citation:** Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How Much Should We Trust Differences-in-Differences Estimates? *The Quarterly Journal of Economics*, 119(1), 249-275. https://doi.org/10.1162/003355304772839588 +**PDF reviewed:** QJE published version (DOI: [10.1162/003355304772839588](https://doi.org/10.1162/003355304772839588)), 28 PDF pages = journal pp. 249-275 (content pp. 249-274 incl. the full Conclusion; References pp. 274-275). Local PDFs are gitignored under `/papers/` (`papers/119-1-249.pdf`); the journal/DOI version is the authoritative source. +**Review date:** 2026-06-26 + +--- + +> **Scope of this paper (important).** BDM (2004) is an *inference* paper, not an +> estimator. It explicitly "assume[s] away biases in estimating the intervention's effect +> and instead focus[es] on issues relating to the *standard error* of the estimate" +> (p. 250). Its two enduring contributions to a DiD library are (a) the **placebo-law / +> randomization-inference diagnostic** that grounds diff-diff's `PlaceboTests` surface, and +> (b) the demonstration that **serial correlation** makes conventional DD standard errors +> grossly understate sampling variability — motivating serial-correlation-robust inference +> over the conventional OLS/HC1 default: unit-level cluster-robust SEs and block/percentile-t +> bootstrap when the number of groups is large, and time-series aggregation for few groups. + +--- + +## Methodology Registry Entry + +*Formatted to match `docs/methodology/REGISTRY.md`. This is a **proposed** entry — it is not yet integrated. Folding it into REGISTRY.md (replacing the current edge-case-only `## PlaceboTests` stub) and promoting the `PlaceboTests` row in `METHODOLOGY_REVIEW.md` from In Progress to Complete are the work of the separate source-validation pass tracked there; this file adds the paper review only.* + +## PlaceboTests + +**Primary source:** [Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How Much Should We Trust Differences-in-Differences Estimates? *The Quarterly Journal of Economics*, 119(1), 249-275.](https://doi.org/10.1162/003355304772839588) Paper review on file: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`. + +**Scope:** BDM (2004) introduces the **placebo-law experiment** — randomly assign a fake +treatment date and a fake set of treated groups, estimate the DD regression, and ask how +often a (necessarily spurious) "effect" is significant. Because the laws are fictitious, +the true effect is zero, so a correctly-sized 5% test should reject ~5% of the time +(p. 251). The fraction of placebo runs that reject is the test's empirical size. Footnote 11 +(p. 256) makes the link to **randomization inference** explicit: "if true laws were +randomly assigned, the distribution of the parameter estimates obtained using these placebo +laws could be used to form a randomization inference test of the significance of the DD +estimate [Rosenbaum 1996]." diff-diff's `PlaceboTests` operationalizes this idea as four +generic diagnostics (`fake_timing`, `fake_group`, `permutation`, `leave_one_out`). + +**Key implementation requirements:** + +*The placebo-law design (the empirical basis of the diagnostic):* +- **Two-step random assignment per replication (Section III, pp. 255-256):** + 1. Draw a fake treatment year from a **uniform distribution over 1985-1995** (footnote 9: + bounded to guarantee pre- and post-intervention observations). + 2. Randomly designate **exactly half the groups (25 of 50 states)** as "affected." + Then `I_st = 1` for units in an affected group after the fake date, else 0. +- **At least 200, typically 400 replications** per cell; the fraction with `|t| > 1.96` is + the empirical rejection rate (Table II note). +- **Permutation / randomization-inference variant:** randomly reassign treatment over a + fixed set of outcomes; the empirical distribution of placebo estimates is the reference + distribution for a randomization test (footnote 11). +- **Serially-uncorrelated placebo (Table II, row 5):** to isolate serial correlation of the + treatment indicator, set `I_st = 1` only in **ten randomly selected years (1979-1999)** for + treatment-group states, 0 otherwise (instead of one permanent post-date switch); rejection + collapses to **5%** (correct size), confirming the serial correlation of `I_st` is the + driver of over-rejection. +- **Power injection:** to test power against a known alternative, replace the outcome by + `outcome + I_st × δ` (e.g. `δ = 0.02` for a uniform 2% effect; footnote 16). + +*Critical caveat on the reference distribution (footnote 12, p. 256):* +- "we are randomizing the treatment variable while keeping the set of outcomes fixed. In + general, the distribution of the test statistic induced by such randomization is **not a + standard normal distribution** and, therefore, the exact rejection rate we should expect + is not known." **Implication for the diagnostic:** the placebo/permutation distribution + *is* the correct reference; comparing a permutation estimate to normal critical values is + not generally valid. This is why a permutation p-value should be read off the empirical + distribution (proportion of placebo `|estimate|` at least as large as observed), not from + a t/normal table. + +*Estimator equation under test (equation (1), p. 250):* + + Y_ist = A_s + B_t + c·X_ist + β·I_st + ε_ist + +where: +- `Y_ist` = outcome for individual `i` in group `s` at time `t` +- `A_s`, `B_t` = group (state) and time (year) fixed effects +- `X_ist` = individual covariates; `c` = their coefficients +- `I_st` = intervention dummy (1 for affected group post-date); `β` = the DD estimate (β̂ via OLS) +- `ε_ist` = error term + +*Aggregated (group-time cell) analogue (footnote 14, p. 258):* + + Ȳ_st = α_s + γ_t + β·I_st + ε_st + +(residualize the individual outcome on covariates, average to group-year cells `Ȳ_st`, then +run the two-way FE DD on the cell means). + +*Edge cases:* +- NaN inference for undefined statistics (diff-diff defensive convention; not from BDM): + - `permutation_test`: t_stat is NaN when permutation SE is zero (all permutations produce identical estimates). + - `leave_one_out_test`: t_stat, p_value, CI are NaN when LOO SE is zero (all LOO effects identical). + - **Note:** Defensive enhancement matching CallawaySantAnna NaN convention. + +**Reference implementation(s):** +- No canonical R/Stata package implements BDM's *placebo battery* as a single command; the + paper's own machinery is custom (block-bootstrap "codes... available upon request", + footnote 21). BDM's two inference *corrections* that did become standard tooling are: + - Stata `cluster` (arbitrary/state-level cluster-robust VCV, Section IV.E). + - Stata `xtgls` (parametric AR(k) corrections, Section IV.A — shown *not* to work). + +**Requirements checklist:** + +*Directly grounded in BDM (2004):* +- [ ] Placebo-law size experiment — randomly assign a fake treatment date and a fake treated group, estimate the DD, and measure the empirical rejection rate (≈5% under the null when correctly sized) (Section III; `permutation` / `fake_group` operationalize the random-assignment exercise). +- [ ] Permutation/placebo p-value read from the empirical placebo distribution, not from normal/t critical values (footnote 12). +- [ ] Power can be probed by injecting a known additive effect `outcome + I_st × δ` (BDM's `δ = 0.02` exercise, p. 261). +- [ ] The inference lesson holds — serial-correlation-robust SEs (unit-level cluster / block bootstrap for large N; aggregation for small N) rather than OLS/HC1 (Section IV, Conclusion). + +*Library extensions of the placebo idea (not prescribed by BDM):* +- [ ] `placebo_timing_test` against a *chosen* pre-treatment period, reframed as a pre-trends diagnostic (BDM draw the fake date at random for a size experiment, not to test a specific pre-period). +- [ ] `leave_one_out` single-unit sensitivity diagnostic. +- [ ] Permutation p-value floor at `1/(n_valid+1)` (finite-draw randomization-inference convention; not stated by BDM). + +--- + +## Implementation Notes + +### Data Structure Requirements +- Individual-level micro data pooled across groups (states) and time (years), as repeated + cross-sections or a panel; or group-time cell means after residualizing on covariates + (footnote 14). +- Required: an outcome, a group identifier, a time identifier, and a treatment indicator + `I_st`. Covariates optional (partialled out before aggregation in the cell-mean variant). +- BDM's reference dataset: CPS MORG women aged 25-50, 1979-1999, ~900k observations (~540k + with positive earnings), 50 × 21 = 1050 state-year cells, outcome `log(weekly earnings)`. + +### Computational Considerations +- The placebo / permutation diagnostic re-fits the DD regression once per replication + (200-400 fits). Cost scales linearly in the number of replications × cost of one DD fit. +- The block bootstrap (Section IV.B) re-fits per bootstrap draw (200 in text / 400 in + Table V note — see Gaps) and is "not immediate to implement" (one block = one group's + full time series). + +### Tuning Parameters + +| Parameter | Type | Default (BDM) | Selection Method | +|-----------|------|---------------|------------------| +| Replications per placebo cell | int | typically 400, ≥200 | More draws → tighter Monte Carlo estimate of size | +| Fake-date window | range | uniform 1985-1995 | Bounded to keep pre/post observations (footnote 9) | +| Fraction of groups treated | float | 0.5 (25 of 50) | Author choice; alternatives tried (footnote 10) | +| Power-test effect size `δ` | float | 0.02 (2%) | Set to the alternative of interest (footnote 16) | +| Block-bootstrap replications | int | 200 / 400 | See Gaps (text/table discrepancy) | +| State resampling probability | float | 1/50 (whole state vectors, w/ replacement) | Preserves within-group autocorrelation (p. 258) | + +### Relation to Existing diff-diff Estimators +- **`PlaceboTests` (`diff_diff/diagnostics.py`)** is the direct descendant: `fake_timing` + ↔ placebo-timing law, `fake_group` ↔ random fake-treated groups, `permutation` ↔ the + randomization-inference test of footnote 11, plus `leave_one_out` (single-unit + sensitivity). The module already cites BDM (2004) in its docstrings. +- **Serial-correlation inference** motivates diff-diff's **unit-level cluster-robust SE** + (the modern form of BDM's "arbitrary variance-covariance" / state-level `cluster` + correction, Section IV.E) and its **bootstrap** inference paths (BDM's block / percentile-t + bootstrap, Section IV.B). BDM's headline warning — conventional DD SEs over-reject when + the outcome and the treatment indicator are both serially correlated — is the reason + unit-level clustering is the recommended default for panel DD. +- **Distinct from per-estimator placebo/LOO** already documented elsewhere in the registry: + SyntheticDiD's donor-pool `leave_one_out()` (ADH 2015 §4) and in-time placebo, HAD + pretests, and DCDH placebo are separate implementations on their own estimators; the + BDM-grounded `PlaceboTests` is the generic 2×2-DD battery. + +--- + +## Detailed Findings + +### 1. The diagnosed problem: serial correlation inflates DD significance + +Three reinforcing features of the DD setting make serial correlation severe (p. 251): +1. DD usually uses **long time series** (BDM's survey of 92 papers finds an average of + 16.5 periods, median 11; Table I, p. 253). +2. Common DD outcomes (employment, wages) are **highly positively serially correlated**. +3. The treatment indicator `I_st` is itself **highly serially correlated** (it changes + rarely within a group). + +OLS assumes a diagonal error VCV; clustering on the group-year cell fixes only the +*within-cell* (Moulton [1990]) correlation and leaves the *across-year within-group* serial +correlation untouched — which is why clustering at the state-year level still over-rejects +(p. 257-258). + +### 2. Over-rejection magnitudes (Table II, p. 257) + +Rejection rate of the (true) null of no effect at the 5% level over placebo laws; correct +value is 0.05. + +*Panel A — CPS data:* + +| Row | Specification | Reject (no effect) | Reject (2% effect) | +|-----|---------------|--------------------|--------------------| +| 1 | CPS micro, log wage, OLS | **.675** | .855 | +| 2 | + cluster at state-year | **.44** | .74 | +| 3 | Aggregated (cell means), OLS | .435 | .72 | +| 4 | Sampling states w/ replacement (prob 1/50) | .49 | .663 | +| 5 | **Serially-uncorrelated placebo laws** | **.05** | .988 | +| 6 | Employment (ρ̂₁=.470) | .46 | .88 | +| 7 | Hours worked (ρ̂₁=.151) | .265 | .280 | +| 8 | Changes in log wage (ρ̂₁=-.046) | **0** | .978 | + +*Panel B — AR(1) Monte Carlo (rejection rises monotonically with ρ):* + +| ρ | -.4 | 0 | .2 | .4 | .6 | .8 | +|---|-----|---|----|----|----|----| +| Reject (no effect) | .008 | .053 | .123 | .19 | .333 | .373 | + +Takeaways: (a) uncorrected DD rejects a false null **67.5%** of the time (abstract headline +"up to 45 percent" refers to the aggregated/cluster figures); (b) over-rejection tracks the +autocorrelation of the outcome (row 8 with negative autocorrelation *under*-rejects, 0%); +(c) the serially-uncorrelated placebo (row 5) restores correct 5% size, pinpointing the +serial correlation of `I_st` as the mechanism. Magnitude of spurious effects: mean +`|effect|` ≈ .02; ~60% in 1-2%, ~30% in 2-3%, ~10% above 3% (p. 260). + +**Scaling with N and T (Table III):** over-rejection is essentially **invariant to the +number of groups N** but **falls slowly as the number of periods T shrinks** — still ~15% +at 7 years and ~8% (CPS) to 17% (AR(1), ρ=.8) at 5 years. + +### 3. The four corrections evaluated (Section IV) + +| # | Correction | Section / Table | Verdict | +|---|-----------|-----------------|---------| +| 1 | Parametric AR(k) (Stata `xtgls`) | IV.A / Table IV | **Fails** — short-panel downward bias of ρ̂ (CPS ρ̂≈.4 vs true .51; AR(1) true .8 estimated .62) + AR misspecification leave rejection at 16-39%. Works only if true ρ is known and imposed. | +| 2 | Block (percentile-t) bootstrap | IV.B / Table V | **Works at large N only** — correctly sized at N≈50 (6.5% CPS, 5% AR(1)); degrades to 13% (N=20), 23% (N=10), 43.5% (N=6). Asymptotics in N. | +| 3 | Arbitrary / state-level cluster VCV (Stata `cluster`) | IV.E / Table VIII | **Works at large N, best power** — 6.3% at N=50; over-rejects at small N (8% at N=10, 11.5% at N=6). Near oracle power (74% vs 78%; 27.5% vs 32%). | +| 3′ | Empirical VCV | IV.D / Table VII | Similar to cluster but assumes cross-sectional homoskedasticity; 5.5% at N=50, 15.3% at N=6. | +| 4 | Time-series aggregation (collapse to 2 periods) | IV.C / Table VI | **Best small-N size, lowest power** — 5.3% at N=50 and N=10; residual-aggregation variant handles staggered timing (~9% at N=10); needs the Donald-Lang [2001] small-sample t adjustment; power as low as 6.5% at N=10. | + +**Block-bootstrap percentile-t procedure (Section IV.B, p. 265):** +1. Compute the observed absolute t-statistic `t = |β̂ / SE(β̂)|`. +2. Resample **with replacement 50 whole-group blocks** `(Ȳ_s, V_s)` (state time-series + + its design rows), preserving within-group autocorrelation. +3. Refit OLS → `β̂_r`; form the **recentered** statistic `t_r = |(β̂_r − β̂) / SE(β̂_r)|`. +4. Reject `β = 0` at 95% if **95% of the `t_r` are smaller than `t`** (compare `t` to the + 95th percentile of the bootstrap `t_r` distribution). + +**Arbitrary / cluster VCV (Section IV.E, p. 271):** a White-type sandwich clustering on +**entire groups** (states), not group-year cells (footnote 24): + + W = (V'V)^(-1) ( Σ_{j=1}^{N} u'_j u_j ) (V'V)^(-1), with u_j = Σ_{t=1}^{T} e_{jt} v_{jt} + +where `V` is the matrix of independent variables (state dummies, year dummies, treatment), +`e_{jt}` the residual, and `v_{jt}` the row of independent variables for group `j` at time +`t`. Consistent for fixed T as N → ∞ [White 1984; Arellano 1987; Kezdi 2002]; "analogous to +applying the Newey-West correction in the panel context where we allow for all lags" +(footnote 23). VCV is rank `TN − N` (footnote 25). + +### 4. Practical recommendations (Section IV.F / Conclusion) + +- **Large number of groups (≈50):** arbitrary/state-level cluster-robust SE — good size and + near-oracle power. Block bootstrap is a reliable alternative. +- **Small number of groups:** collapse the time series (aggregate to before/after) with the + Donald-Lang [2001] t-adjustment — correct size at the cost of low power. Use residual + aggregation when laws are staggered. +- **Do not** rely on parametric AR(k) corrections; misspecification yields inconsistent SEs. +- **Core warning:** "conventional DD standard errors may grossly understate the standard + deviation of the estimated treatment effects, leading to serious overestimation of + t-statistics... too many false rejections of the null hypothesis of no effect have taken + place" (Conclusion, p. 273). +- **Make robust inference standard practice (Conclusion, p. 274):** BDM urge practitioners + to "more carefully examine residuals as well as perform simple tests of serial + correlation," and — because serial-correlation-robust standard errors are "relatively easy + to implement in most cases" — argue this "should become standard practice in applied work." + They flag **GLS** and **GMM estimation of dynamic panel data models** as promising + directions for inference that is more efficient under serial correlation (consistent with + footnote 19's deferral of IV/GMM). + +--- + +## Gaps and Uncertainties + +- **Bootstrap-replications discrepancy.** The text (p. 265) says "a large number **(200)**" + of bootstrap samples; the Table V note says "The bootstraps involve **400** repetitions." + Both appear in the paper; treat 200 as a documented floor and 400 as the typical value. +- **Footnote 12 reference-distribution caveat is load-bearing for the permutation test.** + Because treatment is randomized over fixed outcomes, the induced statistic is *not* + standard normal, so the *correct* p-value comes from the empirical placebo distribution, + not a t/normal table. Any implementation that derives a permutation p-value from normal + critical values would misstate significance — a property the source-validation pass should + confirm for `permutation_test` when promoting `PlaceboTests`. +- **No closed-form small-sample correction is proposed.** Wild cluster bootstrap, CR2/CR3, + and Satterthwaite dof all post-date this paper; BDM endorses only the Donald-Lang (2001) + threshold adjustment for the aggregation route. Modern small-N fixes are out of scope here. +- **Transcription note (p. 271).** The paper describes `v_{jt}` as "a row vector of + dependent variables (including the constant)"; in context (`V` is the *independent*-variable + matrix) this is almost certainly a typo for *independent* variables. Preserved verbatim. +- **No numbered theorems/propositions.** Contributions are empirical/Monte-Carlo (Tables + I-VIII) plus equation (1); cited asymptotic results (Nickell 1981, Kezdi 2002, Kiefer + 1980, Donald-Lang 2001) are invoked, not derived. +- **Coverage:** the full article text (pp. 249-274), including the complete Conclusion + (Section V), is reflected above. Only the bibliography (pp. 274-275) is not summarized.