Where we are at

Dec 11, 2024

**What Are We Estimating When We Estimate Difference-in-Differences? Pamela Jakiela**

DiD has come a long way since Card and Krueger's minimum wage paper. What started as a simple comparison between two groups over two time periods has evolved into one of the most sophisticated and widespread tools for causal inference, used across economics, political science, public health, and other social sciences. Its somewhat intuitive appeal and flexibility have made it a go-to method for researchers studying everything from policy evaluation to corporate governance, epidemiology to education.

Just to go over a few examples of the method’s versatility, we have Stevenson and Wolfers' (2006), whose analysis of divorce law reforms revealed that the adoption of unilateral divorce laws significantly reduced domestic violence and suicide rates. In public policy, Hendren and Sprung-Keyser's (2020) unified welfare analysis employed DiD to measure the effectiveness of various social programs, concluding that many interventions deliver considerable welfare gains relative to their cost. Labor market dynamics have also been explored using DiD, as Autor, Dorn, and Hanson's (2016) research on import competition from China exposed the severe disruption caused by trade shocks, including heightened unemployment and wage declines in local labor markets. In public health, Soni (2020) applied DiD to evaluate the effects of Medicaid expansion under the Affordable Care Act, finding that over a five-year period, the expansion increased preventive care utilization and reduced risky behaviors like heavy drinking among low-income people.

Just five years ago, we thought we had it all figured out. Run your two-way fixed effects regression, plot your event study, check for parallel pre-trends, and you're good to go. Remember when we thought two-way fixed effects were the answer to everything? Good times. But then something interesting happened: we realized these standard approaches could give us wrong answers in some very common scenarios. Like when different groups get treated at different times. Or when treatment effects aren't the same for everyone.

Now in 2025, we're in a very different place. We have better tools, sharper insights into what can go wrong, and clearer solutions for many (though not all) of the problems that plagued earlier DiD work. This post is an attempt to summarize where we are, highlighting the key developments that got us here and the challenges that might be ahead.

The first major “crack” in the DiD foundation appeared with Goodman-Bacon's (2021) decomposition paper. It revealed something unsettling: in settings with staggered treatment timing (where different units get treated at different times), the traditional two-way fixed effects (TWFE) estimator is actually a weighted average of all possible 2x2 DiD comparisons. The problem? Some of these comparisons use already-treated units as controls, effectively comparing "late" versus "early" treated units. To make matters worse, the weights can be negative (oh no I said the forbidden word), potentially leading to estimates with the wrong sign.

Let's break this down with an example. Imagine we're studying a policy that different states (you know we had to start with an American setting) adopt in different years. State A adopts in 2015, State B in 2016, and State C never adopts. The traditional two-way fixed effects regression would implicitly use:

State A vs State C (comparing treated to never-treated)
State B vs State C (comparing treated to never-treated)
State B vs State A (comparing late-treated to early-treated)

That last comparison is problematic because it uses State A, which is already being treated, as a control group. If the treatment effect grows over time, this comparison could severely *underestimate* the true effect.

Almost simultaneously, several researchers pointed out another issue: treatment effect heterogeneity. When the impact of a policy varies across units or over time (which, let's face it, is probably the norm rather than the exception), the TWFE estimator might not recover any sensible average treatment effect. De Chaisemartin and D'Haultfœuille (2020) showed that in extreme cases, the estimate could be negative even when the treatment effect is positive for every single unit :O

Think about a minimum wage study where the policy's impact varies by local economic conditions. Cities with strong economies might see minimal employment effects, while economically struggling areas might experience larger impacts. The traditional DiD estimator would weight these effects in ways that don't align with any clear policy-relevant parameter.

These revelations sparked a wave of methodological innovations, and three main approaches emerged:

Callaway and Sant'Anna (2021) proposed focusing on group-time treatment effects - the effect of treatment for units treated at a specific time, measured at each possible time after treatment. Their approach is particularly transparent about the comparison groups being used and allows for flexible treatment effect dynamics.
Sun and Abraham (2021) developed an interaction-weighted estimator that properly accounts for treatment effect heterogeneity in event study designs. Their method is especially useful when researchers want to examine how effects evolve over time relative to treatment.
Borusyak, Jaravel, and Spiess (2024) introduced an imputation approach that estimates untreated potential outcomes and can accommodate continuous treatments and multiple treatment variables.

These methods complement each other: Callaway and Sant'Anna's approach handles varied treatment timing with interpretable results, Sun and Abraham's method brings computational efficiency to event studies, and Borusyak et al.'s approach excels with high-dimensional fixed effects, particularly when treatment effects are homogeneous.

At this point you might be asking yourself: so which method should I choose?

To give a classic economist response: the choice largely depends on your setting and research goals. If you're interested in the average treatment effect (ATE) and have a classic staggered adoption design, Callaway and Sant'Anna's method is a solid choice. It's particularly useful when you suspect treatment effects might vary by group or time and want to examine this heterogeneity directly. Their DiD package in R (version 2.1.0) makes implementation straightforward.

Sun and Abraham's method is a nice choice if you're running an event study and want to understand treatment dynamics (effects over time). It's especially good at handling settings where early and late adopters might have different treatment effects. Their interaction-weighted estimator is now implemented in Stata's eventstudyinteract (version 1.0.0) command.

Borusyak, Jaravel, and Spiess's imputation method comes into its own when you have high-dimensional fixed effects or continuous treatments. It's also computationally efficient with large datasets. Their approach is implemented in both R (did_imputation, version available on GitHub) and Stata (did_imputation).

But since there's no free lunch, you need to put some thought first. Some of the "pitfalls" include (but are not restrictive to) pre-testing problems, power calculations, and anticipation effects. Many researchers use pre-trend tests to validate their design. But Roth (2022) showed this practice can lead to severe bias and highlighted two key issues: First, if you only proceed with your analysis when pre-trends look parallel, you're implicitly conditioning on a noisy estimate, potentially creating spurious effects. Second, traditional power calculations can be misleading in DiD settings, with many DiD studies being severely underpowered, especially for detecting pre-trends. His DiDpower package helps researchers plan better-powered studies. And what if units react before the official treatment date (which is common in policy settings)? This can invalidate many DiD designs. Recent work by Freyaldenhoven, Hansen, and Shapiro (2019) provides tools for dealing with such anticipation.

Recent extensions are making these methods even more useful:

Multiple treatments: de Chaisemartin and D'Haultfœuille (2023) extend DiD methods to settings with multiple treatment variables, addressing issues of treatment effect heterogeneity and contamination from concurrent treatments. Similarly, Roller and Steinberg (2023) further develop these ideas, proposing adjusted DiD strategies that handle simultaneous treatments - particularly useful for studying policy packages or competing interventions.
Building on this trajectory, Kim and Wooldridge (2024) propose a novel extension that adapts DiD for the estimation of Quantile Treatment Effects on the Treated (QTT). Their framework introduces the Distributional Difference-in-Differences (DDID) assumption, which generalizes the traditional parallel trends assumption by focusing on common changes in untreated potential outcome densities across treated and control groups. This approach not only preserves the intuitive appeal of DiD but also extends its utility to settings where distributional impacts, rather than mean effects, are of primary interest. DDID is a game-changer for researchers studying policies with heterogeneous impacts across the income or health distribution, as it provides insights into distributional shifts that traditional DiD misses.
Continuous treatments: Callaway, Goodman-Bacon, and Sant'Anna (2024) develop methods for settings where treatment intensity varies continuously (like studying the effects of different minimum wage levels rather than just binary minimum wage adoption).
Synthetic controls meet DiD: several papers combine insights from synthetic control methods with DiD, offering more flexible ways to construct comparison groups. Arkhangelsky et al.'s (2021) synthetic difference-in-differences approach is one of them.

ML integration: recent work (e.g. Athey, Tibshirani, and Wager, 2019) explores using machine learning methods to relax parallel trends assumptions and handle high-dimensional settings. ML methods could give DiD a boost by helping pick the best control groups or checking if parallel trends hold up for different subgroups. But the big challenge will be effectively integrating these tools while ensuring the results remain clear and interpretable. These approaches promise more flexible DiD estimation, though they're still in early stages (come back in a year).

So where do we go from here? Despite all this progress, there are still plenty of open questions and challenges ahead.

First, we're still figuring out how to deal with messy real-world settings. What happens when policies overlap or interact with each other? De Chaisemartin and D'Haultfœuille's (2023) work on multiple treatments is a start, but we need more tools for handling complex policy environments. Think about studying the effect of minimum wage when states are simultaneously changing their earned income tax credits or unemployment benefits. Or consider evaluating a health policy when multiple reforms are being rolled out at different times. Time-varying confounders are another headache - imagine studying the impact of environmental regulations when both adoption and outcomes are influenced by changing economic conditions. The current solutions often involve strong assumptions that might not hold in practice.

Then there's the thorny issue of inference with few treated units (we’ve all been there at some point). While we have methods that work well *asymptotically*, many applications involve just a handful of treated units (looking at you, state-level policies). The standard errors (it’s always them) we're calculating might be too optimistic, and randomization inference might not always be the answer. Consider a study of a city-specific policy: when we have just one treated city, how confident can we really be in our uncertainty estimates? Related to this, selection on pre-trends remains a concern: what if treatment timing is related to outcome dynamics? E.g., states might adopt policies in response to worsening conditions, which then creates a relationship between treatment timing and pre-existing trends. The current solutions often involve assuming this selection works in particular ways, but we need more robust approaches.

Some exciting new frontiers are opening up. Network effects are increasingly important - policies often affect untreated units through economic or social connections. Think about evaluating a local business incentive program - firms in nearby, untreated areas might be affected through competition or supply chain links. Current DiD methods typically assume away such spillovers, but new work is trying to incorporate them explicitly. Spatial DiD methods are developing in a fast pace, acknowledging that treatment effects might spread geographically (particularly relevant for studying environmental policies or local economic development programs).

The combination of DiD with other methods is particularly promising. Regression discontinuity designs (RDD) with differential timing could help us understand how effects vary with treatment intensity. Instrumental variables (IV) in a DiD framework might help address endogenous (always a trouble) treatment timing. Some researchers are even exploring synthetic control (SC) methods that incorporate DiD-style assumptions about parallel trends. Each combination brings its challenges, but also new opportunities for credible identification.

On the practical side, we do need better software implementation. Yes, we have R packages and Stata commands for the basic methods, but they often lack features for the latest refinements. Want to incorporate time-varying covariates in a staggered DiD design while allowing for treatment effect heterogeneity? Good luck finding a package that does everything you need. Power analysis tools are improving thanks to Roth et al.'s work, but we need more guidance on designing well-powered studies - especially for detecting heterogeneous effects or testing specific dynamic patterns. And while we're at it, could we get some consensus on how to present results? Some researchers show event study plots, others focus on average treatment effects, and still others present a battery of robustness checks. The field would benefit from standardized reporting practices.

The machine learning frontier looks promising but remains largely unexplored. Current work using random forests and other ML tools to relax parallel trends assumptions is intriguing, but we're still in the early stages. The challenge is maintaining the interpretability and credibility that made DiD attractive in the first place while leveraging these more flexible approaches. Could ML help us detect heterogeneous effects more systematically? Or identify better comparison groups? The potential is there, but so are consequences.

Finally, there's the elephant in the room: what do we do with all those papers using the "old" methods? The profession is still debating how to handle this. Should every paper using two-way fixed effects be revisited? Do we need different standards for different fields or publication dates? Consider a widely-cited paper from 2015 using the traditional TWFE approach - if its findings differ substantially when using newer methods, how should we update our understanding of the evidence? These questions don't have easy answers, but they'll become increasingly important as the new methods become standard.

To conclude, the evolution of DiD methods over the past few years has been nothing but remarkable. We've gone from a simple tool with known limitations to a very sophisticated framework that can handle many real-world situations. But with this sophistication also comes new challenges. Researchers need to make more choices, consider more potential issues, and wrestle with more complex estimation issues.

Looking ahead, the field seems poised for continued innovation. As we tackle more complex policy questions and gain access to richer data, DiD methods will need to keep evolving. The integration with machine learning, the handling of spillovers and network effects, and the development of more robust inference procedures all promise to keep econometricians busy.

But perhaps the biggest challenge isn't technical - it's practical. How do we ensure these methodological advances actually improve empirical work? How do we balance the push for more robust methods with the need for accessibility and ease of use? And how do we maintain the intuitive appeal that made DiD so popular in the first place?

These questions will shape the next chapter in the DiD story. For now, researchers in social and health sciences would do well to stay informed about methodological developments while thinking carefully about which tools best suit their specific applications. After all, the goal isn't methodological sophistication for its own sake - it's credible causal inference in the service of answering important questions. I will try my best to keep you up to date :)

* I would like to thank (Prof) Pedro Sant'Anna for his corrections and suggestions. He just added all his DiD materials to his website, highly recommend checking it out!

* All mistakes are mine, and if you find one, please let me know.

References

Software

Borusyak, K. (2021). did_imputation: Stata and R packages for DID imputation method. Available at: https://github.com/borusyak/did_imputation.
Callaway, B., & Sant'Anna, P. H. C. (2021). did: Treatment Effects with Multiple Periods and Groups (R package version 2.1.0). Comprehensive R Archive Network (CRAN). Available at: https://CRAN.R-project.org/package=did.
Roth, J. (2022). DIDpower: Power calculations for panel and DID designs (R package version 1.2.0). GitHub. Available at: https://github.com/jonathandroth/DIDpower.
Sun, L. (2021). eventstudyinteract: Stata package for event study regression with treatment effect heterogeneity (version 1.0.0). Available at: https://github.com/lsun20/eventstudyinteract.

DiD Digest

Where we are at

References

Software

Further readings