More DiD Drops: Continuous Treatments, Sample Selection, and Inference Tests
What to do when your treatment isn’t binary, your sample leaks, and your SEs collapse
Hi there!
We got a few updates a few days after I published the other post.
Here they are:
Difference-in-Differences for Continuous Treatments and Instruments with Stayers, by de Chaisemartin, D’Haultfœuille, Pasquier, Sow, and Vazquez-Bare (This paper was originally circulated 3 years ago (- Sow). de Chaisemartin, D'Haultfœuille and Vazquez-Bare have one published in the AEA P&P named Difference-in-Difference Estimators with Continuous Treatments and No Stayers).
Difference-in-Differences and Changes-in-Changes with Sample Selection, by Javier Viviens
Inference with Modern Difference-in-Differences Methods, by Yuji Mizushima and David Powel
Prof. Clément de Chaisemartin’s in-depth training on DiD, a series of videos organized by Prof. Andrea Albanese and hosted by LISER & DSEFM, University of Luxembourg.
Let’s go :)
Difference-in-Differences for Continuous Treatments and Instruments with Stayers
(The 2024 AEA P&P paper laid foundational work for DiD estimators in settings with continuous treatments and no stayers (i.e., cases where all units change treatment intensity across time). In contrast, this 2025 paper generalises and complements that work by allowing for the presence of stayers (i.e., units whose treatment remains unchanged across periods), while also providing better identification and robust estimation. These stayers serve as the comparison group in a more “classical” DiD ~spirit~.)
TL;DR: this paper proposes new DiD estimators for settings where treatments are continuous (like prices or taxes) and some units don’t change their treatment over time (i.e. "stayers"). It defines two interpretable slope-based estimands (one for local effects (AS) and one for realised-policy evaluations (WAS)) and shows how to identify them using switchers vs. stayers with the same baseline treatment. The estimators are doubly robust, nonparametric, and come with pre-trends tests and an IV extension. If the 2024 AEA P&P paper gave us DiD for continuous treatments with no stayers, this paper gives us the rest of the picture.
What is this paper about?
In many situations where we are interested in applying DiD methods, we assume binary treatments and rely on having a control group that never receives treatment. But many real-world policies (e.g. tax rates, prices, or even pollution levels) are continuously distributed. And sometimes, everyone is “treated,” just to different degrees. This paper tackles this problem we have. It proposes DiD estimators for continuous treatments in settings where some units change their treatment (“switchers”) and others stay at the same level (“stayers”). You can think of it as when a state that raises its gasoline tax while others keep theirs unchanged. The stayers provide the comparison group, just like in a “classic” DiD. Instead of estimating treatment effects in levels, the paper focuses on slopes: how outcomes change as treatment intensity changes.
The authors define two estimands:
AS (Average of Slopes): the average effect of treatment changes among switchers; it gives equal weight to all switchers regardless of treatment change magnitude.
WAS (Weighted Average of Slopes): a slope summary that gives more weight to larger treatment changes, which is helpful for cost-benefit analysis. WAS weights by the absolute value of treatment changes.
They also extend the method to IV settings (e.g. using tax changes to estimate price elasticity), allow for covariates, dynamic treatment effects, and multi-period panels, and provide doubly-robust, nonparametric estimators with valid inference and pre-trends tests.
What do the authors do?
They propose two estimators: the Average of Slopes (AS), which captures how outcomes change with treatment for switchers on average, and the Weighted Average of Slopes (WAS), which weights those changes by how much treatment actually changed. Both estimators compare switchers to stayers with the same baseline treatment and rely on a new parallel trends assumption that allows for flexible heterogeneity. They also show how to estimate these slopes using doubly-robust, nonparametric estimators with valid inference, even in small samples. The paper extends the framework to handle instrumental variables, multiple time periods, covariates, and even dynamic treatment effects. They wrap it all up with an application on gasoline taxes and a Stata package for implementation.
Why is this important?
This is really important because most real-world treatments are not binary. Prices, taxes, subsidies, exposure level, they all often vary in intensity, not just presence. Traditional DiD estimators break down in these cases, especially when there’s no clear control group. This paper offers a practical fix. By comparing switchers and stayers at the same starting point, it builds credible counterfactuals without needing everyone to be untreated. It also delivers interpretable slope-based effects that can answer both “what-if” and “was-it-worth-it” questions. And unlike many continuous-treatment approaches, these estimators are nonparametric, doubly-robust, and testable using pre-trends. That makes them much more usable in applied work.
Who should care?
Basically anyone studying policies where treatment varies by degree and not just presence. That includes researchers working on:
Taxes and subsidies
Prices and price regulation
Pollution exposure or health risk levels
Education or welfare policies with varying intensity
It’s especially useful for applied micro folks frustrated by DiD methods that fall apart when there's no clean control group. If you’ve ever had to drop observations just because “everyone got some treatment,” this paper is for you.
Do we have code?
Yes, partially. The authors have released a Stata package called did_multiplegt_stat
(currently available via SSC) that implements some of their estimators. It’s still under development, but enough is there to apply the method to standard two-period and multi-period settings. No R version yet.
In summary, this paper fills a big gap in the DiD toolbox: how to estimate treatment effects when everyone’s treated, but to different degrees. It introduces slope-based estimands that are interpretable, testable, and robust, and opens the door to more credible applied work with continuous policies. If the earlier AEA P&P paper gave us DiD without stayers, this one gives us the stayers and the estimators we need. Just keep in mind some things that the authors themselves noted: while this approach offers clear advances over prior DiD work, it's not without constraints. The AS estimator can't handle "quasi-stayers" (units with tiny treatment changes), which prevents consistent estimation. In applications with many time periods, the method assumes no dynamic effects, meaning outcomes are only influenced by current treatment, not past treatment. Though the authors suggest fixes for this, these workarounds shrink the sample size. And as with all DiD methods, the underlying parallel trends assumption remains fundamentally untestable for the actual treatment period, even if placebo tests look promising.
Difference-in-Differences and Changes-in-Changes with Sample Selection
(Javier is a 3rd-year PhD student at the European University Institute in Italy. This is his WP. I had fun reading it, Javier’s writing is really pleasant and easy to follow)
TL;DR: when treatment affects who stays in your sample, DiD can break down. This paper shows that DiD estimates can become non-causal (or even undefined) under sample selection, especially when outcomes aren't observed or even well-defined for some units. It adapts Lee bounds to both DiD and Changes-in-Changes (CiC) settings and proposes a method to estimate causal effects for the “Always-Observed” stratum (e.g. the people for whom outcomes exist under both treatment and control). The paper also relaxes monotonicity assumptions and offers a partial identification strategy that holds up even when treatment is confounded.
What is this paper about?
This working paper tackles a classic problem DiD designs often can’t account for: what if treatment affects who’s in your sample? Think dropout, death, non-response, or employment status. If treatment changes who shows up in the data, standard DiD estimates can become biased or even meaningless. The paper focuses on units for whom the outcome is always defined (what Javier calls the Always-Observed stratum). For this group, he develops a way to estimate both average and quantile treatment effects, even when the treatment is confounded and sample selection is not ignorable. He also adapts Lee bounds to both DiD and Changes-in-Changes (CiC) frameworks, providing partial identification results under weaker assumptions than existing methods.
What does the author do?
Javier reworks DiD for cases where sample selection isn’t ignorable, aka endogenous (e.g. when people drop out of a study because of the treatment, or when the outcome just doesn’t exist, like wages for the unemployed). He focuses on the subset of people for whom the outcome is well-defined regardless of treatment status and builds identification and inference strategies around that group. He generalises Lee bounds to the DiD and CiC settings, allowing for partial identification of:
The average treatment effect for Always-Observed units (ATTAO), and
The quantile treatment effect (QTTAO), to explore distributional impacts.
In the paper he also extends the method to relax monotonicity assumptions by using multiple sources of sample selection (like attrition and dropout), which lets him point-identify stratum proportions in more realistic settings. Finally, he puts everything to work in a job training program evaluation in Colombia, showing how naïve DiD would have overstated the treatment effect. It’s the kind of paper that quietly fills a gap many applied researchers have likely tiptoed around without realising how deep it goes.
Why is this important?
In real-world data, who is observed often depends on the treatment. Job training affects employment; education policies affect dropout. And when outcomes don’t exist for the unobserved (like wages for the unemployed), even the causal estimand can fall apart. Most DiD applications quietly assume this isn’t a problem. This paper says otherwise, and gives you tools to deal with it. By focusing on Always-Observed units and adapting Lee bounds, it lets you estimate effects that are well-defined, interpretable, and policy-relevant, even when selection is messy and treatment is confounded. If you’ve ever run DiD on a shrinking sample and just hoped for the best, this might be your way out of it.
Who should care?
Anyone doing DiD with panel or survey data where attrition, dropout, or censoring might be related to treatment. That includes:
Labour economists studying training, unemployment, or job loss
Education researchers tracking dropout or test participation
Health economists dealing with mortality or survey non-response
It’s also relevant for RCTs with imperfect compliance or missing outcomes, especially when researchers try to squeeze DiD or CiC frameworks onto real data that leaks. If your outcomes only exist for a non-random slice of your sample, this is for you.
Do we have code?
An R package is in development (we should give Javier a RA), though not yet public. The paper is very clear on the estimation steps, so if you're comfortable with trimming and quantile bounds, you could replicate the method manually.
In summary, when treatment affects who’s in your sample, your DiD estimates might not mean what you think they do (or anything at all). This WP offers a sharp correction: estimate effects only for units with well-defined outcomes under both treatment arms. It adapts Lee bounds to DiD and CiC, covers both mean and quantile effects, and relaxes classic assumptions like monotonicity. A clean, careful contribution that quietly plugs a major gap in applied DiD work. Javier also identifies many promising extensions to build upon this work like: extending the method to settings with multiple pre- and post-treatment periods, which would help test identification assumptions more thoroughly; adapting the approach for staggered adoption designs through pairwise comparisons or integration with doubly robust estimators; incorporating covariates to tighten bounds and increase credibility; and even adopting a Bayesian perspective when bounds are too wide to be informative. Something to look forward to.
Inference with Modern Difference-in-Differences Methods
(Yuji is a PhD student at RAND Graduate School and an assistant policy analyst at RAND, while David is a a senior economist at RAND and a member of the RAND School of Public Policy faculty)
TL;DR: most new DiD estimators fix bias from staggered adoption and heterogeneity, but what about inference? This paper runs large-scale simulations to test how well these estimators behave when the number of treated units is small. The authors find that most over-reject the null, especially with fewer than 10 treated units. The one approach that consistently stays close to the nominal 5% rejection rate? An imputation-based estimator paired with a wild bootstrap. If you’ve been assuming the standard errors are fine, you might want to read this.
What is this paper about?
In this paper the authors evaluate how well modern DiD estimators perform when it comes to inference, especially in small-sample settings. This isn’t a new methods paper, it’s a stress test for the DiD tools we already use. Over the last few years, a wave of new DiD methods have corrected for issues like staggered adoption and treatment heterogeneity, but most of the focus has been on point estimates. Yuji and Powell ask: do these methods produce reliable standard errors and valid p-values when the number of treated units is small? To answer that, they simulate thousands of placebo policies (in CPS data and synthetic panels) and compare how often each estimator falsely rejects the null when it shouldn't. The paper covers seven popular estimators (including Callaway and Sant’Anna, Sun and Abraham, Borusyak et al., Gardner’s 2SDID, and DCDH) and evaluates different inference procedures: cluster-robust SEs, wild bootstrap, randomization inference, and loads more.
What do the authors do?
They run a battery of simulations (both in real CPS data and fully synthetic panels) to see how different DiD estimators perform under the null. The key question is: do these methods reject the null too often when the number of treated units is small? Most do. They test seven leading DiD estimators, including: Callaway and Sant’Anna (2021), Sun and Abraham (2021), Borusyak et al. (2024), Gardner’s 2SDID, Wooldridge (2024), de Chaisemartin & d’Haultfoeuille (DCDH), and stacked DiD. Each is evaluated with its default inference method (usually asymptotic or cluster-robust), and sometimes also with wild bootstrap or randomization inference. They vary sample size, number of treated units, and treatment heterogeneity. Their best-performing combo? Imputation-based estimators paired with wild bootstrap, which is stable even with just 5 treated units. They also call attention to two competing biases at work: small samples lead to over-rejection, while treatment heterogeneity inflates standard errors and reduces rejection rates. By comparing simulations with and without heterogeneity, they show these biases don't really cancel each other out. Surprisingly, they also find that for some estimators, when we increase the number of untreated units, it actually worsens rejection rates rather than improving them, which goes against the usual intuition that more untreated units = better inference.
Why is this paper important?
It’s important for a couple of reasons: most of us don’t have the time (or the human resources) to run 1,000 simulations every time we worry about small-sample inference. This paper does that work for us. It shows that many modern DiD estimators (despite fixing bias from staggered timing and treatment heterogeneity) still rely on inference procedures that fall apart when the number of treated units is small. Cluster-robust SEs and asymptotic formulas often lead to massive over-rejection. That means our p-values might be “lying” to us, and that published findings could be reporting spurious significance just because inference was off. This paper also points to safer alternatives, like pairing imputation methods with wild bootstrap, that behave much better in these settings.
Who should care?
Anyone doing DiD with a small number of treated units, which includes cases where researchers are:
Evaluating state- or district-level policies with a few adopters
Analysing corporate or regional rollouts
Working with staggered adoption in small samples
It’s especially relevant if you’ve moved beyond TWFE but still rely on default SEs, or assume that new methods automatically fix inference too. This paper is a reminder that point estimates and p-values live separate lives, and we need to validate both.
Do we have code?
This is less a “new code” paper and more a “here’s how to use the tools you already have properly” paper. The authors use existing Stata packages for all the estimators (like did2s
, csdid
, did_imputation
, etc.) and pair them with inference strategies like wild bootstrap and randomization tests. No new package is released, but if you’re already using these estimators, you can replicate their setup pretty easily.
In summary, Yuji and Powell take the DiD estimators many of us already use and ask a simple question: do they behave when treated units are few? In most cases, the answer is no. Inference breaks down, p-values overstate significance, and standard errors underperform. But not all hope is lost: pairing imputation-based estimators with a wild bootstrap comes closest to nominal rejection rates, even with just 5 treated units. If you’re working with small samples or rare treatment, this paper is a practical guide to doing inference that actually holds up.
I miss studying econometrics in school, these are great, please keep it up!!