Heterogeneity, Small Sample Inference, Anticipation, and Local Projections

On uniform confidence bands for conditional effects, a simple approach for small-N inference, refining the "no anticipation" assumption, and the LP-DiD estimator for staggered designs.

Jul 22, 2025

Hi there! I am back from China now, and I have a few posts lined up on ML and CI in general, but first we will go through the ones related to DiD.

A bit of housekeeping first. We got two package updates when I was away.

We spoke about these papers here and here :)

And today’s post is about:

Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences, by Shunsuke Imai, Lei Qin, and Takahide Yanagi.
Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes, by Soo Jeong Lee and Jeffrey M. Wooldridge.
Refining the Notion of No Anticipation in Difference-in-Differences Studies, by Marco Piccininni, Eric J. Tchetgen Tchetgen, and Mats J. Stensrud.
A Local Projections Approach to Difference-in-Differences, by Arindrajit Dube, Daniele Girardi, Òscar Jordà, and Alan M. Taylor (this is the JAE published version - congrats!! - of the WP that went around in 2023. Also for Stata users, there’s an update to the locproj package here).

Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences

(Shunsuke is a second-year PhD student at Kyoto University)

TL;DR: this paper shows how to study treatment effect heterogeneity in staggered DiD designs when you care about a continuous covariate, like the pre-treatment poverty rate. The authors build on Callaway and Sant'Anna (2021) to estimate group-time Conditional Average Treatment Effects (CATTs) that vary with that covariate. They construct uniform confidence bands so we can see which parts of the curve are meaningful. The method is doubly robust and works under standard identification conditions. The procedure combines parametric estimation for nuisance functions with nonparametric methods for the main parameter of interest. The authors provide an R package (`didhetero`) to help implement the methods.

What is this paper about?

This paper is about how to study treatment effect heterogeneity in staggered DiD designs. The goal is to move beyond group-time averages and see how treatment effects vary depending on the value of a continuous pre-treatment covariate, like the poverty rate. The authors focus on estimating group-time Conditional Average Treatment effects on the Treated (CATTs). This tells us, for each group and period, how the effect changes across the distribution of a pre-treatment variable. Instead of asking whether a policy worked, we can ask for whom it worked, and how strongly .

The second part of the paper is about inference. It is one thing to estimate a curve, but we also want to say which parts of that curve are statistically meaningful. The authors construct uniform confidence bands for the CATT function. These bands help us see where effects are large, where they are noisy, and where they are indistinguishable from zero. The method adapts the doubly robust estimand from Callaway and Sant’Anna (2021) to a more granular, conditional setting. It works under standard assumptions and uses a three-step procedure that combines both parametric and nonparametric estimation. The result is a tool that can handle covariate heterogeneity, treatment timing, and proper inference at the same time.

What do the authors do?

They start with the doubly robust estimand from Callaway and Sant’Anna (2021) and build on it to estimate conditional treatment effects. The main innovation is extending this framework to recover how effects vary with a continuous covariate. To do this, they propose a three-step procedure that combines parametric estimation of nuisance components (the outcome regression and generalized propensity score) in the first stage with nonparametric local polynomial regressions in the second and third stages. This setup captures nonlinearity without forcing a specific shape onto the CATT function.

For inference, they develop two ways to construct uniform confidence bands: one based on an analytical approximation and another using weighted or multiplier bootstrapping . This part is technically demanding. They show how nonparametric smoothing and the presence of estimated nuisance terms affect the distribution of their estimator, and they prove that the bands have the correct asymptotic coverage . The statistical theory extends results from recent work on conditional average treatment effects in unconfoundedness setups to the staggered DiD setting. The paper includes simulations and discusses practical aspects like how to pick bandwidths and estimate standard errors. It also defines several summary measures based on the CATTs, which can help with interpretation when there are many groups and time periods.

Why is this important?

When we employ DiD we often end up reporting some average effect for a group or a time period and move on. But averages aren’t always helpful because they smooth over interesting variation, and sometimes that variation is the whole point. This paper shows how we can move beyond average effects. Instead of asking if a policy worked on average, we can ask who it worked for. Did high-poverty counties see a benefit? This paper shows how to estimate treatment effects that change with covariates like the poverty rate and then build confidence bands around them so we know what we’re seeing is real.

It’s useful because this kind of question comes up all the time in applied work. Think about the minimum wage. The extent to which minimum wage increases reduce poverty depends on the structural relationship between wage gains and job losses. A group average can’t tell you that, but this method can help assess it. And what’s nice is that it works with staggered treatment timing, which is the norm in many real-world applications. You don’t have to pretend that everyone was treated at once or that effects are constant across space and time. You get a picture that’s both more honest and more informative. That’s the kind of thing you want when the stakes are high and the outcomes matter.

Who should care?

If you’re working with staggered treatment and you already use group-time ATT methods like Callaway and Sant’Anna (2021), you should check this paper. It adds another layer: instead of just estimating how effects vary by group and time, you can now see how they shift across something continuous, like the poverty rate. It’s helpful if you’re trying to answer questions like: does this policy work better in richer or poorer counties? Is the effect stronger where unemployment was already high? This paper gives you a way to answer those kinds of questions without needing a separate model for every subgroup. And it gives you proper confidence bands so you can see where the heterogeneity is real and where it’s just noise. The authors also mention that even though their main example is minimum wage and poverty, this method is general and you can apply it to any staggered DiD setup where treatment effects might depend on a baseline variable.

Do we have code?

We have a package: didhetero (Treatment Effect Heterogeneity in Staggered Difference-in-Differences). It “provides tools to construct doubly robust uniform confidence bands (UCB) for the group-time conditional average treatment effect (CATT) function given a pre-treatment covariate of interest and a variety of useful summary parameters in the staggered difference-in-differences setup of Callaway and Sant’Anna (2021)”.

In summary, this paper is a nice reminder that treatment effects are more than numbers, they’re functions, and once we start thinking about them that way, we need the right tools to estimate and interpret them. The authors take a standard DiD setup and give us a way to ask questions like how effects shift with poverty or unemployment, while still keeping the core assumptions intact. You don’t need to change your identification strategy or build a different model for every subgroup, this method fits into what you already do and just makes it more informative.

Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes

(Soo-jeong is Professor Jeffrey’s soon-to-be-former student. Here’s his thread about this paper)

TL;DR: standard statistical methods for DiD studies are unreliable when you have a small number of treated or control groups (e.g., a single state or a few hospitals). This paper proposes a simple solution: for each group, calculate the average outcome before the policy and the average outcome after. Then, run a basic cross-sectional regression on these averages. This simple trick provides statistically valid confidence intervals and t-tests, even with extremely small samples, such as one treated unit and two control units. It offers a straightforward and easy-to-implement alternative to more complicated methods like Synthetic Control, giving us a reliable tool for policy evaluation when data is limited.

What is this paper about?

This paper looks at a familiar setup (tracking outcomes for treated and control units before and after some intervention) and asks what to do when you don’t have many units on either side. Usually DiD works fine if you have lots of treated and control units since you can lean on large‑sample theory and cluster‑robust errors. Here the authors show a simple trick: collapse each unit’s time series into two numbers (the pre‑intervention average and the post‑intervention average) then run an ordinary cross‑sectional regression of that time‑collapsed outcome on a treatment indicator. Under the usual linear‐model assumptions (normal errors, constant variance) you get exact inference even if you have just one treated unit and two controls. They then extend that to remove unit‑specific trends (fitting a little time trend in the pre‑period, subtracting it off, then averaging), and show we can still do exact t‑tests on that transformed data. Finally, they compare this to synthetic control and synthetic DiD approaches, run simulations and apply the method to California’s 1989 smoking restrictions (one treated state) and to staggered “castle law” rollouts. The upshot is a very easy‐to‐implement alternative when sample sizes in the cross‑section are small.

What do the authors do?

The authors take a hands‑on look at how to get valid confidence intervals when you have very few treated or control units. They start by showing that you can collapse each unit’s entire time path into just two numbers (its average outcome before the policy and its average outcome after). You then run an ordinary cross‑sectional regression of those two‑period averages on a simple indicator for whether a unit was treated. Under the familiar assumptions of a linear model with normal, homoskedastic errors, that single regression gives you exact t‑tests and confidence intervals even if you have as few as one treated unit alongside two controls. Next they make the method more flexible by carving out linear trends at the unit level. They fit a straight‑line trend to each unit’s pre‑treatment history, subtract it from the full series, and then average the residuals before and after treatment. You can then plug those de‑trended averages into the same cross‑sectional regression and still get exact inference. After laying out these core ideas they run through simulations that confirm the small‑sample accuracy of their approach, and then they demonstrate it on two real cases: California’s 1989 smoking ban and a staggered rollout of castle‑law enactments. That empirical work shows the method is fast to implement and gives results that line up well with more elaborate techniques

Why is this important?

A lot of times we applied researchers don’t have the benefit of working with large samples due to the lack of proper data. For example if you’re studying a law change in one state or a corporate rollout in a handful of markets, the usual cluster‑robust approach can give wildly misleading confidence intervals when you only have a few clusters. This paper offers a way to get honest measures of uncertainty in those settings without complicated bootstrap schemes or heavy-duty synthetic control machinery. By boiling each unit’s history down to a before‑and‑after average (or its de‑trended version), you end up with a tiny cross‑sectional regression where the standard t‑statistic really follows a t‑distribution. That means you can report confidence intervals you can trust, even with just one treated unit. It gives us a straightforward fallback when we lack large samples or when more elaborate methods are hard to justify in small‑n contexts.

Who should care?

This paper will matter to anyone running a policy evaluation where you don’t have the luxury of dozens of treated and control units. If you’re an applied economist looking at a single state’s law change or a public health researcher tracking a handful of hospitals before and after an intervention, you’ll run into the limits of cluster‑robust inference with small samples. It’s also useful for consultants or corporate analysts who roll out a pilot program in just a few markets and need reliable uncertainty measures without wrestling with heavy bootstraps or complex synthetic controls. In those cases this simple before‑and‑after averaging trick gives you a clear way to get honest confidence intervals even when your cross‑section is tiny.

Do we have code?

For this paper there isn’t a dedicated R or Python library you need to install. Once you’ve collapsed each unit’s time series into a pre‑ and post‑treatment average (or its de‑trended counterpart), you just run an OLS of that two‑period outcome on your treatment indicator. In Stata you could type something like “reg post_pre_diff treated, robust” and the built‑in t‑statistic is exact under the paper’s normal‑error assumptions. If you want an exact p‑value without leaning on normality, you can use the user‑written ritest command for randomization inference. To compare against synthetic DiD you’d load the sdid package in Stata 18, but you don’t need any fancy routines for the core approach. A few lines of code in R or Python (compute the averages, run lm() or statsmodels.OLS) and you’re done.

In summary, this paper serves as a reminder that sometimes the simplest solution is the most effective. In a field that often trends toward more complex estimators, the authors show how a clever data transformation combined with foundational econometric principles can solve a difficult and common inferential problem. Here we can see a “DiD problem” through a different lens. Soo-jeong and Prof Jeffrey provide a method that is transparent, easy to implement and statistically sound even in small samples. It empowers us to make credible claims in the exact settings where evidence is often most needed but hardest to analyze.

Refining the Notion of No Anticipation in Difference-in-Differences Studies

(I really enjoyed reading this paper - it did have more words than the others)

TL;DR: this paper resolves a common point of confusion about the “no anticipation” assumption in studies that employ DiD. We often worry that people might change their behaviour in anticipation of a policy, but the formal assumption simply states that future treatments can’t affect past outcomes, which is a trivial point about the arrow of time. The authors argue this mismatch stems from conflating the policy’s implementation with the decision to implement it. By introducing a separate variable for the “plan” or “decision” (P), they provide a clearer framework. This clarifies what “no anticipation” really means, shows how the standard DiD estimator can be biased if people react to the plan (as expected) and helps us specify whether we are estimating the effect of the plan itself or its ultimate implementation.

What is this paper about?

This paper tackles a subtle but important ambiguity at the heart of DiD methodology. Many DiD guides and papers state that a no anticipation assumption is required for the method to be valid. Formally, this assumption is often written in a way that says an intervention at a future time point does not affect an outcome in the past. The problem is that in any standard causal model the future can’t affect the past “by definition”. This makes the no anticipation assumption seem either trivially true or completely unnecessary which may be why some foundational DiD papers don’t even mention it. This has led to widespread confusion: is this an important identifying assumption we need to worry about or a redundant statement about time travel?

The authors argue that this confusion arises because the formal assumption fails to capture what researchers are really concerned about. When we talk about anticipation, they don’t mean that the policy itself reached back in time. They mean that knowledge of a future policy (the plan, the announcement, the waiver application) caused people or firms to change their behaviour before the policy was officially implemented. For example, in a study of Medicaid expansion, insurers might change their premiums as soon as a state announces its plan, well before the expansion actually begins. This paper’s goal is to resolve this ambiguity by providing a new, expanded causal model that formally separates the policy’s implementation from the prior decision to implement it.

What do the authors do?

Ok, let’s go in parts. The authors’ main contribution is to clarify the causal model underlying DiD. They do this by introducing a new variable,P, which represents the plan or decision to implement a policy. This decision P occurs at a specific point in time and is distinct from the policy’s actual implementation, A_2, which it causes. The key is that the decision P can happen before the pre-treatment outcome Y_1 is measured. With this expanded model, they propose a new, more meaningful no anticipation assumption (Assumption 5): the plan P has no average effect on the pre-treatment outcome Y_1 for the group that plans to adopt the policy. Unlike the standard assumption, this is a non-trivial statement about behaviour that could be violated in the real world. This framework has a few consequences.

They show that if a researcher (perhaps implicitly?) assumes parallel trends with respect to the plan (P=0) but there are anticipation effects (their new assumption is violated), then the standard DiD estimator is biased. Proposition 1 demonstrates that the DiD estimator identifies the true Average Treatment Effect on the Treated (ATT_A_2) minus a bias term, ψ, which is exactly the effect of the plan on the pre-treatment outcome. They also argue that in some cases, the researcher might be more interested in the effect of the decision itself (ATT_P) rather than the effect of the implementation (ATT_A_2). For example, an announcement of a car recall may cause people to stop driving the faulty cars, meaning the decision had a huge effect even if the subsequent act of seizing the cars had none. They show that under their new no-anticipation assumption, the standard DiD functional identifies ATT_P.

Finally, they extend this logic from the simple two-period case to the more complex staggered adoption setting, showing how their framework can clarify the limited anticipation assumptions used in modern DiD estimators like Callaway and Sant’Anna (2021).

Why is this important?

This paper provides essential clarity on a foundational assumption that has been a source of confusion for decades. It closes the gap between the informal, intuitive meaning of anticipation and its flawed mathematical formulation. This matters for applied work for two main reasons.

First, it makes the potential for bias explicit. If you are studying a policy that was announced well in advance, you must now seriously consider whether that announcement itself changed behaviour. If it did, this paper shows that your standard DiD estimate of the implementation’s effect is likely biased, and it formalizes the source and structure of that bias.

Second, it forces us to be more precise about what causal question we are asking. Is the goal to measure the effect of a law actually being implemented (ATT_A_2)? Or is it to measure the total effect of the policy process, starting from its public announcement (ATT_P)? These are different questions with different answers and the choice has real consequences for interpretation. This paper provides the formal language needed to distinguish between them and to state the assumptions required to identify each one.

Who should care?

This paper is a must-read for any applied researcher who uses or teaches DiD. This includes economists, political scientists, epidemiologists, and other social scientists who evaluate policies or interventions. It is especially important for those studying policies that are announced publicly before they take effect, which covers the vast majority of new laws and regulations. If your research involves a scenario where people could plausibly react to the knowledge of a future treatment (whether it’s a minimum wage increase, a new environmental rule, or a coming tax change) this paper provides the conceptual tools to handle it correctly. It fundamentally clarifies the assumptions discussed in popular practitioner guides and recent econometric surveys, making it essential reading for both students and experts in the field.

Do we have code?

No, this is a conceptual and theoretical paper, not a new estimator. It does not come with a new R or Stata package because its purpose is to refine the thinking that precedes the coding. The paper clarifies the causal model, the estimand of interest, and the identifying assumptions. The tools used to calculate the DiD functional (e.g., packages like did and fixest in R, or csdid and reghdfe in Stata) are unchanged. The contribution of this paper is to help you decide which estimand you are targeting (ATT_A_2 vs ATT_P) and to be explicit about the no anticipation assumption that your identification strategy relies on.

In summary, this paper solves a long-running methodological puzzle: why does the no anticipation assumption in DiD appear to be a self-evident statement about the impossibility of time travel? The authors show that researchers have simply been using the wrong formal language for the right intuition. The real concern is not about the policy itself affecting the past, but about the plan for the policy affecting behaviour in the present. By formally separating the policy’s implementation from the decision to implement it, the paper provides a clear and coherent framework for thinking about anticipation effects. It’s a nice reminder that before we rush to estimate, we must first be precise about what it is we are trying to estimate and under what assumptions. The paper’s contribution is not a new command to type, but a new clarity of thought.

A Local Projections Approach to Difference-in-Differences

TL;DR: this paper offers a solution to the “negative weighting” problem that biases standard DiD estimates in settings with staggered treatment adoption. The authors propose an approach based on local projections (LP), a method common in macroeconomics. The “LP-DiD” estimator runs a separate, simple regression for each post-treatment period. The innovation is to restrict the sample in each regression to only include newly treated units and “clean controls” (units that have not yet been treated). This transparently avoids the problematic comparisons that cause bias, is computationally fast, highly flexible, and can replicate the results of more complex modern DiD estimators.

What is this paper about?

This paper enters the ongoing conversation about how to properly conduct DiD analysis when treatment is “staggered”. A recent wave of econometric research has shown that the traditional two-way fixed-effects (TWFE) regression fails in this setting. The core issue is often called the “negative weighting” problem. In a staggered design the TWFE estimator implicitly uses already-treated units as controls for more recently treated units. For example a state that adopted a policy in 2005 might be used as part of the control group for a state that adopts the same policy in 2010. This is a “forbidden comparison” because the 2005 state may still be experiencing its own dynamic treatment effects, making it a contaminated or “unclean” control. This can lead to the TWFE estimate being a weighted average of the true effects where some weights are negative, sometimes producing an average effect that is nonsensical or even has the wrong sign. In response, this paper proposes an intuitive framework called LP-DiD. It leverages the Local Projections (LP) method to estimate dynamic effects and combines it with a straightforward “clean control” condition to sidestep the negative weighting problem entirely. The result is a regression-based tool that is easy to implement and understand, yet powerful enough to stand alongside other recently developed DiD estimators.

What do the authors do?

The authors’ approach breaks the problem down by estimating the treatment effect for each post-treatment period, or horizon, separately. For each horizon, say, two years after treatment, they run a distinct regression. The outcome variable in this regression is the change in the outcome from the period just before treatment up to that two-year mark. The key variable of interest is an indicator that flags units at the exact moment they become treated.

The real innovation lies in how they construct the sample for each of these regressions. To get a clean estimate they only include two groups of units: the newly treated units (at the moment they switch into treatment) and the clean controls. A clean control is a unit that remains completely untreated all the way through the specific horizon being estimated. For the two-year effect, for example, the controls are units that are still untreated two years later. This simple rule is powerful because it guarantees that already-treated units are never used as controls, which is the source of the negative weighting bias in older methods.

The authors show that this basic procedure estimates a variance-weighted average of the effects across different treatment cohorts. They then demonstrate how to easily recover the more standard, equally-weighted average effect using familiar techniques like weighted least squares or a two-step regression adjustment. This flexible framework is also shown to easily accommodate covariates, situations where treatment isn’t permanent and different ways of defining the pre-treatment baseline.

Why is this important?

The clean control condition is highly intuitive. It makes it easy to understand and explain exactly which comparisons are being made to identify the treatment effect, in contrast to more “black-box” estimators. At its core, the method is just a series of OLS regressions on different subsamples of the data. This makes it computationally fast, which is an advantage when working with very large panel datasets where more complex estimators can be slow. The LP-DiD approach helps demystify the new DiD literature. The authors show that their estimator, with different weighting schemes, is numerically equivalent or very similar to other leading methods, such as those proposed by Callaway and Sant’Anna (2021) and Borusyak, Jaravel, and Spiess (2024). This reveals the common logic underlying these different approaches. The framework is easily adapted to different empirical contexts. Researchers can modify the clean control definition for non-absorbing treatments, include covariates, or pool estimates over various horizons, all within a straightforward regression setup.

Who should care?

Any applied researcher using DiD with panel data, especially in settings with staggered treatment adoption, should be aware of this paper. It will be particularly useful for:

Economists, political scientists, and public health researchers looking for a robust, easy-to-implement alternative to traditional TWFE.
Researchers who value transparency and want to be able to clearly articulate the identifying assumptions and comparisons underlying their estimates.
Analysts working with large datasets (e.g., using administrative or worker-level panel data) who would benefit from a computationally efficient estimation method.
Instructors of econometrics courses, as LP-DiD provides a very clear and teachable example of how to solve the negative weighting problem in modern DiD.

Do we have code?

Yes. The authors have released a Stata command, lpdid, which implements the estimators discussed in the paper. The paper also provides STATA example files on GitHub. While the method is simple enough to be implemented manually with basic regression commands, the dedicated package streamlines the process of estimation, reweighting, and inference. The paper also provides a clear example of how to use the built-in teffects ra command in Stata to obtain the reweighted estimates.

In summary, this paper provides an elegant and intuitive bridge from the world of traditional DiD to the modern literature on robust estimation with staggered treatment. By framing the problem through the lens of local projections, a familiar tool from macroeconomics, the authors show that the notorious “negative weighting” bias that plagues TWFE can be solved with a simple and intuitive sample selection rule. The LP-DiD approach estimates dynamic effects by running a series of simple regressions, each focused on a specific post-treatment horizon and using only clean comparisons between newly treated units and those not yet treated. This approach demystifies what many newer DiD methods are doing under the hood, offering us a tool that is not only robust but also exceptionally flexible, transparent, and computationally fast.

Gwangmin Kim

Jul 24

I like the LP-DID framework since such approach would be way more simpler and cleaner than other approaches, but I wonder whether this means the outcome variable of the LP-DiD should be detrended (if there is a unit root) or deseasonalized as well, similar to the original implementation and issues discussed in the LP literature

Expand full comment

DiD Digest

Discussion about this post