Four new DiD studies + a textbook update

In celebration of 500 subs :) happy to have you all here !

Apr 28, 2025

Hi there! It’s been a while :) It turns out some interesting new developments hit the web in the meantime, and they’re the ones we will be talking about today.

I highlight 5, in no specific order:

Conditional Triple Difference-in-Differences, by Dor Leventer
Triple Instrumented Difference-in-Differences by Sho Miyaji
Covariate Balancing Estimation and Model Selection For Difference-in-Differences Approach, by Takamichi Baba and Yoshiyuki Ninomiya
Spatial Synthetic Difference-in-Differences, by Renan Serenini and Frantisek Masek (it was posted last year but some of you might not be aware)
Credible Answers to Hard Questions: Differences-in-Differences for Natural Experiments, by Clément de Chaisemartin and Xavier D'Haultfœuille (this is a textbook originally published two years ago, but it has just been updated.)

Conditional Triple Difference-in-Differences

(Dor is a PhD candidate - advised by prof Saporta-Eksten - and apparently a very prolific one!)

TL;DR: your trusty triple-difference (TDID) regression stuffed with dozens of controls can still feed you biased estimates whenever the treatment and comparison groups carry different mixes of those controls. This pre-print pins down the bias mathematically, then fixes it with a re-weighted, doubly-robust TDID estimator: one set of weights forces covariate balance, another piece mops up any remaining outcome misspecification. Dor gives us a R package, `tdid`, so you can swap the new estimator into your code and get triple differences that really reflect the causal effect you care about.

What is this paper about?

Researchers often extend DiD to a triple difference (TDID) when an extra contrast is needed (e.g., time × treated × group). In practice, many studies add covariate controls and then difference the two DiD estimates, assuming the residual bias cancels. Dor shows this fails whenever the groups have different covariate distributions (because the group-specific bias in conditional parallel trends generally does not wash out). Merging the unconditional TDID framework of Olden & Møen (2022) with the conditional-DiD framework of Callaway-Sant’Anna (2021), he formalises a conditional TDID setting, proves the conventional estimator is biased, and derives an alternative estimand that is recoverable: ATT for group A minus a covariate-reweighted conditional ATT for group B (using group A’s X-distribution). A double-robust/weighted-double-robust (DR/WDR) estimator with influence-function theory delivers consistent inference, and the accompanying R package tdid makes implementation and Monte-Carlo replication easy.

What does the author do?

Dor develops a conditional TDID framework that unifies two literatures: the unconditional triple‐difference set-up of Olden & Møen (2022) and the covariate-adjusted DiD framework of Callaway-Sant’Anna (2021). Within this framework he:

Diagnoses the problem:
Theorem 1 shows that the familiar recipe (e.g., run DiD + controls separately in each group and difference the two estimates) does not deliver the desired causal contrast whenever the groups’ covariate distributions differ. The bias stems from group-specific deviations in conditional parallel trends that fail to cancel.
Defines an estimand that is identifiable:
He shifts attention to
\(ATT_A − E ⁣[CATT_B(X)∣G=A]\)
i.e., the treatment effect for group A minus a covariate-reweighted conditional ATT for group B, where the weights replicate group A’s covariate mix. Theorem 2 proves this quantity is identified under the same conditional-parallel-trends assumption.
He pairs the usual DR estimator for group A with a weighted DR (WDR) estimator for group B, then differences the two. Because either the outcome model or the generalized propensity score can be misspecified (but not both), the estimator remains consistent and asymptotically normal; an influence-function derivation yields valid standard errors.
A toy analytical example and a Monte-Carlo exercise show that the conventional estimator is biased (direction and magnitude match the theory), while the DR/WDR estimator is centered on the true parameter.

Why is this important?

Triple‐difference designs are everywhere in applied work. Dor reviews 66 highly cited TDID papers in top journals and finds that 73 % add controls and 70 % rely on a fully interacted three-way specification. Yet the paper proves that this widespread practice is generically biased whenever the covariate mix differs across groups, which is exactly the scenario that motivates adding controls in the first place. The contribution therefore closes a serious identification gap: it shows when and why the usual estimator fails and replaces it with a re-weighted, double-robust alternative whose consistency can survive misspecification in either the outcome model or the generalized propensity score. That safeguard is valuable for empirical researchers who increasingly pair DiD methods with high-dimensional or machine-learning covariates. Beyond theory, the paper is practice-ready. The accompanying R package tdid implements the weighted DR estimator and reproduces the simulation evidence, lowering the cost of adoption for graduate students, replication teams, and journal reviewers alike.

Who should care?

Applied micro-economists in labour, public, development, health, education, or environmental economics who already lean on triple differences to sharpen identification.
Empirical researchers and journal referees who worry that adding controls may quietly re-introduce bias they thought the third “D” removed.
Methodologists extending the DiD toolkit to high-dimensional controls, ML adjustments, or heterogeneous treatment effects.
Graduate students and replication teams tasked with validating published TDID results that rely on the conventional “DiD-with-controls, then difference” workflow.

Do we have code?

We have a package, tdid on GitHub. With it we can implement the double-robust (DR) and weighted DR estimators laid out in the paper. It provides a vignette that walks through the toy example and Monte-Carlo simulation, and exposes helper functions for influence-function standard errors and covariate-reweighting diagnostics.

Installation is one line:

devtools::install_github("dorleventer/tdid")

After that, estimating a bias-corrected TDID is essentially:

fit <- tdid(y, time, treat, group, x_covariates, data = df) summary(fit)

The README and vignette reproduce every figure in the paper, so users can trace each step from identification to inference.

When your empirical design leans on a triple difference and covariate adjustment, those estimates are only as good as the identification behind them. This paper (and the accompanying tdid package) supply a theoretically sound and implementation-ready fix, ensuring the third “D” does what you intended: isolate causality, not compound bias.

Triple Instrumented Difference-in-Differences

(Sho will be starting their PhD at Yale this fall, which is really impressive! Keep an eye out for them :) Being familiar with Econometrics notation will help you understand their work better)

TL;DR: standard DID-IV compares two groups over two periods; Sho shows how to add a third dimension (e.g., season × state × time) and still recover a local ATE. The key is a triple Wald-DID ratio (DDD in outcomes divided by DDD in treatment) that is valid under monotonicity plus “common acceleration” (a triple-difference analogue to parallel trends). The framework extends to staggered roll-outs and ordered (non-binary) treatments, and comes with clear guidance on estimation and asymptotic inference.

What is this paper about?

Triple-difference designs sometimes face endogeneity in the treatment itself; researchers therefore instrument and difference, but the identification conditions had not been formally spelled out. In this paper Sho defines a triple DID-IV set-up, pins down the target estimand (LATET for the subgroup affected by the instrument in the third dimension), and states the assumptions (monotonicity and common acceleration in both treatment and outcome) that make the triple Wald-DID ratio valid. The analysis covers both two-period and staggered-adoption panels.

What does the author do?

Sho introduces notation for a two-period, three-group setting (exposed/unexposed x instrumented/non-instrumented × demographic group) and defines the triple Wald-DID estimand.
They prove that the ratio of DDDs equals the LATET under monotonicity and common-acceleration assumptions (Theorem 1), and generalise to ordered treatments (Theorem 2).
Then they partition units into cohorts by instrument-start date, define cohort-specific LATETs (CLATTs), and show how to estimate them with either a never-exposed or last-exposed control cohort (Theorems 3–5).
Finally, they provide influence-function formulas and show that an IV regression with triple-interaction dummies in both first stage and reduced form delivers the estimator and its standard error.

Why is this important?

Instrumented DiD is already popular when no clean control group exists; empirical work is increasingly layering a third difference (season, demographic subgroup, geography) on top of an instrument without a clear blueprint for identification. Sho supplies that blueprint, shows the precise conditions under which the familiar “DDD-IV” regression isolates causal effects, and offers a robust alternative to two-way-fixed-effects IV estimators that can be badly biased under staggered timing.

Who should care?

Applied researchers using instruments that switch on only for a subset within a treated group (e.g., environmental regulations applied only in summer months in participating states).
Econometricians extending DiD/IV methods to heterogeneous treatments, staggered timing, or multi-dimensional policy variation.
Reviewers assessing DDD-IV studies who need a clear checklist of assumptions and estimators.

Do we have code?

No public package yet. Estimation boils down to a standard IV regression with fully interacted group × time × instrument dummies; the paper’s appendix gives ready-to-copy equations for both two-period and staggered panels. (If a replication package appears, it will likely be linked from the Sho’s homepage or the arXiv record.)

Covariate Balancing Estimation and Model Selection for Difference-in-Differences

TL;DR: when you run a DiD you usually weight observations by a propensity score, and if that score is even a little wrong your estimate can drift. Baba & Ninomiya show how to choose weights that force the weighted treated and control samples to match on the chosen covariate moments (even quadratic ones), so bias disappears even when the propensity model is misspecified as long as the before-minus-after outcome change is linear in those covariates. They also derive a smarter information criterion that tells you which covariates are really worth including.

What is this paper about?

Abadie’s semiparametric difference-in-differences (SDID) estimator is unbiased only when the propensity-score model is correctly specified; misspecification introduces bias. Baba and Ninomiya propose an alternative SDID procedure, covariate balancing for DID (CBD), that chooses propensity-score weights by solving moment conditions that force the weighted treated and control samples to match on selected covariate moments, including second-order terms such as xxᵀ. The resulting average-treatment-effect-on-the-treated (ATT) estimator is doubly robust: it stays consistent if either (i) the propensity-score model is correct or (ii) the before-minus-after change in outcomes is linear in the covariates. Baba and Ninomiya derive its large-sample distribution and show analytically and via simulations that CBD removes the bias seen with conventional maximum-likelihood weights when the propensity model is misspecified. Because the weights themselves are estimated, standard information criteria (AIC, GIC) are inappropriate for covariate selection. The authors develop an asymptotically unbiased risk-based information criterion whose penalty depends on the estimated weighting matrix and the heteroskedasticity of the weighted outcomes; in practice this penalty is often much larger than the familiar 2 × (number of parameters). Simulations confirm that the new criterion selects sparser, lower-risk models than an AIC-style adaptation (QICw). A re-analysis of the LaLonde job-training data illustrates the practical gains from both CBD weighting and the new model-selection rule.

What do the authors do?

Instead of guessing a propensity-score formula and hoping for the best, they choose weights that make the weighted treated and control samples exactly match on selected covariate moments—including quadratic (covariate × covariate) terms. Think of it as forcing the scales to balance before comparing outcomes. With those balanced weights in place, they run the usual before/after comparison, but with a twist: the ATT estimate stays consistent if either the weight recipe is correct or the before-minus-after change in outcomes is linear in those covariates. One correct piece is enough. Adding covariates can help or just add noise. Classic AIC-style rules under-penalise that noise once weights are estimated, so the authors derive a larger, data-dependent penalty that tells you when an extra variable earns its keep. In simulations, their method wipes out the bias that appears when the usual propensity model is misspecified, and the new information criterion selects lean, low-risk models. Re-analysing the famous LaLonde job-training data, the CBD criterion selects a noticeably different (much sparser) set of covariates than an AIC-style rule, illustrating the practical impact of both the balancing weights and the new model-selection penalty.

Why is this important?

Applied DiD studies often spend pages justifying that treatment and control “look similar.” CBD makes them similar by construction, so that diagnostic “headache” disappears. If your usual propensity-score equation is even slightly misspecified, traditional weights can push your estimate off target. CBD cushions that risk; one correct ingredient (either the weights or a simple outcome trend) is enough to keep the estimate on track. Adding controls can just as easily inflate variance as reduce bias. The new information criterion tells you quantitatively when a covariate earns its keep, instead of relying on ad-hoc *p-value fishing* or AIC rules that are too forgiving once weights are estimated. In simulations and the LaLonde re-analysis, CBD removed bias, kept models lean, and produced a noticeably larger and more precise treatment effect than the standard approach. In other words: fewer assumptions, tighter answers.

Who should care?

The method is most directly useful in two-period DiD designs (pre-/post- only). Multi-period settings will need future extensions, which the authors mention in their discussion.

Researchers running DiD in economics, public health, education, or policy who rely on propensity-score weights and worry about whether those weights truly balance their samples.
Any analyst who already uses “covariate-balancing” tools (CBPS, kernel balancing, etc.) is also a natural user, because CBD brings that logic into the DiD world.
Replication teams and journal reviewers who want a transparent check that treated and control groups are comparable rather than trusting a black-box propensity model.
Methodologists developing robust causal estimators. CBD adds a practical, doubly-robust tool to the DiD arsenal.

Do we have code?

The article gives formulas, theorems, and simulation/empirical results, but it does not include any replication code, software appendix, or GitHub link. The only software-related remark is that the LaLonde data come from the R package Matching; everything else is presented algebraically and in tables. (The authors do mention that the weighting matrix can be computed with GMM or GEL, but they do not supply code for doing so.)

Spatial Synthetic Difference-in-Differences

(Thanks prof Renan for sending me this!)

TL;DR: this paper extends Synthetic Difference-in-Differences (SyDiD) to settings where policies spill across space, violating the no-interference (SUTVA) assumption. By embedding a spatial-weights matrix in SyDiD’s weighted two-way-fixed-effects regression, the authors create Spatial SyDiD (SpSyDiD), which simultaneously estimates direct effects on treated units and indirect (spillover) effects on their neighbours. Monte-Carlo experiments show SpSyDiD outperforms both “vanilla” SyDiD (which ignores spillovers) and Spatial DiD (which uses uniform weights) in bias and precision. A re-analysis of Arizona’s 2007 employer-sanctions law illustrates the method.

What is this paper about?

This paper gives the classic DiD toolkit a badly needed “GPS upgrade” for the real world, where policies rarely stay inside state lines. Think minimum-wage hikes that nudge workers across borders, congestion charges that reroute traffic into the suburbs, smoking ban in one city that push smokers next door, or a jobs program in one region that can poach workers from its neighbours. In standard DiD we kind of assume these cross-border ripples don’t exist, which can bend our estimates. To tackle this problem, Serenini and Masek fuse two well-known ideas:

Who’s my neighbour? A spatial-weights matrix lists, for every region, which other regions lie close enough to feel its policy shock and how strongly they are connected.
Who’s my best match? Synthetic DiD already re-weights control regions and pre-treatment periods so the treated region’s trend is mimicked as closely as possible.

Merge the two and you get Spatial Synthetic DiD (SpSyDiD), a simple weighted regression that delivers two headline numbers instead of one:

Direct effect (τ): what happens inside the region that actually adopts the policy.
Spillover effect (τₛ): how much of that impact seeps into its neighbours.

The authors show, with simulations covering “many treated counties” and the classic “one treated state” setup, that omitting τₛ can induce noticeable bias and extra variance; in their county experiments the relative error reaches a few percentage points and grows with stronger spillovers. SpSyDiD, by contrast, keeps both direct and indirect estimates on target without giving up the nice robustness features of Synthetic DiD. A case study makes it concrete: Arizona’s 2007 employer-sanctions law cut the share of working-age non-citizen Hispanics in the state by –2.7 percentage points, while neighbouring Nevada, Colorado and New Mexico saw a +0.9 pp uptick, which is evidence that the law displaced people rather than making them disappear from the data.

What do the authors do?

Serenini and Masek take Synthetic DiD (the re-weighting framework that already softens the parallel-trends assumption) and graft onto it the core machinery of spatial econometrics. They let the treatment indicator bleed through a row-standardised spatial-weights matrix W, so the regression now contains two coefficients: τ, the change experienced by the directly treated region, and τₛ, the change that reaches its neighbours. They preserve SyDiD’s unit weights (ω) and time weights (λ), which means the new estimator, Spatial SyDiD, can be implemented with the same weighted least-squares routine once those weights and W are in hand. The paper spells this out as a six-step recipe that needs nothing fancier than ordinary matrix algebra. After building the estimator, they show why it matters. A short algebraic detour reveals that if a researcher ignores spillovers, the average treatment effect is biased by the factor (1 + ρ W̄), where ρ measures how strongly shocks diffuse and W̄ is the average share of treated neighbours. In other words, the more porous the border, the further your naïve DiD slides from the truth. The authors then stress-test their method in two simulation worlds. In a county-level setting with many treated units, Spatial SyDiD recovers both the Average Treatment on the Treated and the Average Indirect Treatment Effect with errors under three percent, while plain SyDiD (which has no spillover channel) can only speak to the direct effect and spatial DiD (which uses uniform weights) is less precise. Repeating the exercise at the state level with a single treated unit produces the same ranking: Spatial SyDiD is markedly closer to the data-generating truth and roughly halves the root-mean-squared error relative to its competitors. To show the estimator at work in real data, they revisit Arizona’s 2007 Legal Arizona Workers Act. Using Current Population Survey microdata, Spatial SyDiD attributes a 2.7-percentage-point fall in working-age non-citizen Hispanics to the law inside Arizona, alongside a 0.9-point rise in neighbouring Nevada, Colorado and New Mexico—evidence that the legislation displaced people rather than erasing them from the labour force. Finally, the paper adapts Synthetic DiD’s placebo-based inference to the spatial setting. By repeatedly re-assigning “treated” and “neighbour” labels among control units, the authors build a finite-sample variance for both τ and τₛ without leaning on additional distributional assumptions, giving practitioners an off-the-shelf way to attach standard errors to each component of the effect.

Why is this important?

Most policy studies assume each region is an island; when that fails, estimates can be way off. Spatial SyDiD gives researchers a plug-and-play fix: it keeps SyDiD’s relaxed parallel-trend strengths while explicitly measuring how much impact leaks across borders. That means: (i) cleaner causal claims when spillovers exist; (ii) a transparent split between “what happened here” and “what we pushed onto the neighbours,” which is exactly what policymakers worry about; and (iii) no need for exotic software, you just add a spatial weight matrix to workflows people already use. In short, it turns an unchecked threat to validity into something you can quantify, interpret, and report.

Who should care?

Applied micro-economists evaluating policies that plausibly shift jobs, prices, or people across county or state lines.
Urban and regional planners measuring knock-on effects of zoning changes, congestion charges, or transit investments.
Public-health and environmental scientists tracking how smoking bans, pollution controls, or epidemics spill into neighbouring areas.
Political scientists studying policy diffusion and electoral spillovers.
Impact-evaluation teams at governments and NGOs that rely on DiD/SCM workflows but face obvious “border leakage.”

Do we have code?

Yes! We have a repo on GitHub that you can clone to reproduce the findings of the paper (and maybe adapt the code to your own research).

DiD Digest

Discussion about this post