Efficient Estimation, Sequential Synthetic Control DiD, and the TWFE Debate

Some practical insights for a more precise and robust estimation.

Beatriz Gietner

Jun 27, 2025

Hi there! We are back to talking about DiD in specific :) Here are the latest papers we will discuss:

Efficient Difference-in-Differences and Event Study Estimators, by Xiaohong Chen, Pedro H. C. Sant’Anna, and Haitian Xie
Sequential Synthetic Difference in Differences, by Dmitry Arkhangelsky and Aleksei Samkov
When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators, by Tobias Rüttenauer and Ozan Aksoy

Efficient Difference-in-Differences and Event Study Estimators

(After all, who never heard someone saying “this DiD should’ve been an event study” at a seminar…)

TL;DR: modern DiD estimators handle heterogeneity but often ignore efficiency. This paper builds estimators that hit the semiparametric efficiency bound under standard assumptions, with no functional form restrictions, no extra modelling, just better use of the data. Confidence intervals then shrink, power goes up, and all the pieces are there for anyone working with short panels and staggered treatment.

What is this paper about?

This paper is about improving the precision of DiD and Event Study estimators. Modern methods account for treatment effect heterogeneity and staggered adoption, but they tend to handle pre-treatment periods and control groups in ad hoc ways. Many estimators either drop baseline periods or assign equal weights to them, even though these choices are rarely supported by theory or data.

The authors offer a formal solution1: they build a framework for efficient estimation of DiD and ES parameters that works with short panels, staggered treatment timing, and standard identification assumptions like PT and no anticipation. They show how to characterise these assumptions as conditional moment restrictions and use that structure to derive semiparametric efficiency bounds.

The main insight is that some pre-treatment periods and control groups carry more information than others. By using optimal weights based on the conditional covariance of outcome changes, the proposed estimators deliver tighter confidence intervals and smaller root mean squared error with no added assumptions and no loss in consistency, which then gives us a principled way to use all available data more effectively.

What do the authors do?

This paper is dense, so we’ll go section by section (each answers one question).

After the Introduction, section 2 (Framework, causal parameters, and estimands) answers “what are we estimating?”. The authors formalise the setup: short panels, staggered treatment, absorbing states, and potential outcomes indexed by treatment timing. They define the causal parameters of interest (group-time ATTs and event-study summaries) and lay out standard identification assumptions such as random sampling, overlap, no anticipation, and PT. They allow for both post-treatment-only and full-period parallel trends (PT), with or without covariates.

In section 3 (Semiparametric Efficiency Bound for DiD and ES), they answer “what’s the best we can hope for?”. This is the theoretical core of the paper. They derive semiparametric efficiency bounds for both ATT and ES estimators under the two types of PT. These bounds represent the lowest possible asymptotic variance for any estimator that satisfies the DiD identification conditions. The result is that most DiD estimators are inefficient because they either ignore or misweight pre-treatment and comparison groups. They also provide closed-form expressions for the efficient influence function (EIF). These formulas make clear how to combine periods and groups using weights based on the conditional covariance structure of outcome changes.

In section 4 (Semiparametric efficient estimation and inference), they translate the theory into practice by proposing two-step estimators that use plug-in methods for nuisance components (like outcome regressions and group probabilities) and then apply the EIF weights to get point estimates. They show that this approach can be implemented using either flexible nonparametric methods (like forests or splines) or simple parametric working models. They also explain how to stabilise the procedure by estimating ratios of group probabilities directly.

In section 5 (Monte Carlo simulations), they calibrate simulations using CPS and Compustat data and compare their efficient estimators to popular alternatives (TWFE, Callaway-Sant’Anna, Sun-Abraham, Borusyak et al.). Across designs, the efficient estimators reduce root-mean-squared-error and confidence interval width (often by more than 40%) without increasing bias.

In section 6 (Empirical Illustration) they reanalyse data from Dobkin et al. (2018) on the effect of hospitalisation on out-of-pocket medical spending. Using the publicly available Health and Retirement Study (HRS), they estimate treatment effects based on variation in the timing of hospitalisation. Their efficient estimators produce tighter confidence intervals than existing methods. To match the same level of precision with standard DiD approaches, researchers would need about 30% more data. Their analysis also showcases a fascinating result of the method: the ability to use a post-treatment period as a baseline by “bridging” through the never-treated group. They also use a form of specification curve analysis to visually assess the stability of the estimates across all valid modeling choices, thus providing evidence for the plausibility of the PTA.

They conclude by providing practical advice and future directions. They suggest that we should test whether some pre-treatment periods or control groups are less informative or potentially invalid. To support this, they introduce Hausman-type overidentification tests and offer visual tools to assess sensitivity to different weighting schemes. They also point to extensions: nonlinear DiD models, switching treatments, unbalanced panels, and setups without PT. The framework is designed to be flexible, but it’s built for the standard large‑n, fixed‑T world. When that changes, so should the estimator.

Why is this important?

I love everything about efficiency, but sometimes that’s too much to ask. Most DiD papers focus on identification and stay there. Estimators are consistent, robust to heterogeneity, and safe, but they leave a lot of precision on the table. In a refreshing way, this paper argues that we can have both: credible identification and minimal variance. By working directly with the structure implied by the PTA, the authors show how to build estimators that are semiparametrically efficient. No extra assumptions, no black boxes, no functional form restrictions, and no need for large samples, just better use of the data we already have. One of the key insights of this paper is that DiD setups are nonparametrically overidentified (which means there’s more information available than most estimators use). That’s what makes these efficiency gains possible. I also have to note that while negative weights in some DiD estimators can be problematic, here they are not a concern: they arise naturally from the covariance structure and are a feature of optimal estimation, not a bug. Efficiency matters because in this case because it gives us tighter confidence intervals, more stable estimates, and more power by using the design more intelligently. The fact that the estimators are Neyman orthogonal adds to their credibility as it makes them less sensitive to misspecification of the preliminary estimation steps.

Who should care?

Anyone working with DiD or ES designs using short panels, staggered treatment or small samples. If you’ve ever looked at your standard errors and thought “this feels bigger than it should be”, you should read this one carefully. It’s also useful for methodologists and software developers who want to understand what makes an estimator efficient, and how to design tools that make full use of available data without relying on strong assumptions. Even if you don’t implement the estimators right away, this paper gives you a benchmark: what is the best you could be doing, given your design?

Do we have code?

Professor Pedro said “we will work on packaging everything so you can adopt all this seamlessly”. The estimators are presented in closed form and the implementation relies on standard tools (regression adjustments, propensity scores and conditional covariances), so while there’s no package, the ingredients are all there for anyone comfortable with two-step semiparametric estimation (not sure how many of us to be honest). The simulation and empirical application sections hint at how to put the pieces together and the estimators can be implemented in R or Python with off-the-shelf tools.

In summary, this paper takes the designs we already trust and shows how to push them further. The authors prove that standard DiD setups contain more information than we usually use, and they show how to recover it with estimators that are efficient, transparent, and grounded in the assumptions we already make.

Sequential Synthetic Difference-in-Differences

TL;DR: this paper introduces the Sequential Synthetic Difference-in-Differences (Sequential SDiD) estimator, a new method for event studies with staggered treatment adoption (particularly useful when you doubt the PTA). It works by applying SC principles sequentially to cohort-aggregated data; it estimates effects for early-adopting groups and uses those results to impute counterfactuals for later-adopting groups. The authors’ key theoretical result is that this estimator is asymptotically equivalent to an ideal but infeasible “oracle” OLS estimator that has access to the unobserved factors driving the violation of PT.

What is this paper about?

The rise of synthetic control (SC) methods, starting with Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010)2 offered applied researchers a new way to construct counterfactuals when PT are unlikely to hold. In 2021 Arkhangelsky, Athey, Hirshberg, Imbens, and Wager gave us the combination of DiD and SC (adding up the structure of SC with the intuition of DiD) in their seminal paper, and now in this one, Arkhangelsky and Samkov extend that framework to settings with staggered treatment adoption (the setting behind many modern policy evaluations).

They propose a method, Sequential Synthetic DiD, that builds counterfactuals *cohort by cohort*, using the information from early adopters to inform the estimation for later ones. It operates on cohort-aggregated data, estimating treatment effects one horizon at a time, and updating the data after each step to reflect imputed counterfactuals. This structure helps correct for violations of PT caused by unobserved time-varying confounders while preserving the transparency we value in DiD.

A key theoretical result is that this estimator behaves like an “oracle” (a benchmark) OLS regression that “knows” the unobserved interactive fixed effects. This connection provides both valid inference and formal efficiency guarantees. The method handles staggered timing, treatment effect heterogeneity, and violations of PT and it includes standard DiD and recent imputation estimators (like Borusyak et al. 2024) as special cases, depending on how the weights are chosen.

What do the authors do?

They introduce a new estimator, Sequential Synthetic Difference-in-Differences (Sequential SDiD), designed for event studies with staggered treatment adoption, where PT may fail due to unobserved time-varying confounders. The method builds on Synthetic DiD (Arkhangelsky et al., 2021)3, but adapts it for aggregated cohort-level data and implements it sequentially. They formalize causality by interpreting the observed outcomes using potential outcomes (Neyman, 1990; Rubin, 1974; Imbens and Rubin, 2015).45

For each adoption cohort, they estimate treatment effects at each horizon and update the observed outcomes with imputed counterfactuals. These updated values are then used in the estimation for later cohorts. This recursive structure is the paper’s core innovation: it reduces the risk of bias accumulation by preventing treated observations from contaminating the control pool in subsequent steps.

Also, to make the method directly useful for applied work (you might have asked yourself by no “but how do I handle control variables?”), the authors provide significant practical guidance on how to incorporate covariates (in Section 46). They focus on time-invariant characteristics and outline three different strategies: full stratification, a recommended hybrid model that allows for direct application of their algorithm, and a specific approach for cases with group-level treatment assignment.

They formalise this estimator by connecting it to a theoretical benchmark: an “oracle” OLS regression that has access to the true unobserved interactive fixed effects. This oracle represents an idealised model of what applied researchers try to do when including unit-specific trends or unobserved factor structures. The authors then prove that their estimator is asymptotically equivalent to this oracle under “mild conditions”7. Their results provide a strong theoretical backing, which includes valid inference, asymptotic normality, and the first formal efficiency guarantees for an SC-type method.

A key intermediate result of the paper, presented in Proposition 3.1, is that the infeasible oracle OLS estimator can be represented and computed by a sequential algorithm (Algorithm 2) that mirrors the main Sequential SDiD estimator. The authors note this is a “result of independent interest” because it reveals the underlying mechanics of OLS in staggered settings and provides “new insight into the mechanics of modern imputation estimators” like Borusyak et al.

Inference is conducted using the Bayesian bootstrap, and the authors also propose a placebo-style validation check by artificially shifting adoption dates backward (a way to assess model fit in the absence of PT).

They demonstrate the method’s performance through both an empirical application (on Community Health Centers, using Bailey and Goodman-Bacon 2015) and two simulation exercises. In settings where standard DiD performs well, Sequential SDiD yields similar results, but in scenarios with unobserved confounding, it remains stable while DiD becomes severely biased.

At the end they outline two key limitations. The first being that the method requires reasonably large adoption cohorts to ensure that aggregation averages out idiosyncratic noise. Second, it assumes that idiosyncratic errors are independent across units so that residuals concentrate around zero when averaged. They say that while these assumptions are common in modern DiD applications, they may be restrictive in contexts where individual-level shocks have strong aggregate effects.

Why is this important?

Most modern DiD estimators still depend on some version of the PTA, even when considering those designed for staggered adoption and heterogenous treatment effect. PTA means that untreated units can serve as a valid counterfactual for treated ones, which often doesn’t hold up when treatment is timed by factors we cannot observe or measure. In practice we would try to deal with this by adding unit-specific trends or other proxies for unobservables, but these adjustments only go so far.

Sequential SDiD offers an alternative for us because it addresses violations of PT by working with aggregated cohort data and estimating effects sequentially, using SC-style weighting to build credible counterfactuals one step at a time. This setup is way more robust when unobserved time-varying confounders drive both outcomes and treatment timing, which is exactly the kind of threat that undermines “traditional” DiD.

Their method is also grounded in familiar econometric foundations. The authors work in an asymptotic regime with many units and a fixed number of time periods, connecting their framework to the classic moment-based panel data literature (like Chamberlain, 1984). They operate in a low-noise environment, similar to the conditions studied in the SC literature (by averaging within large cohorts), which makes the method both practically feasible and theoretically sound.

Sequential SDiD also connects back to simpler methods. If the regularisation is set very high, the weights collapse to uniform ones, and the estimator reduces to a version of imputation-based DiD, closely related to Borusyak et al. (2024). This makes it easy to interpret and benchmark against more familiar designs.

Finally, the paper fills a gap in the SC literature by establishing formal efficiency guarantees8. The authors prove that Sequential SDiD is consistent, unbiased, and achieves first-order efficiency by showing it mimics an oracle OLS estimator that knows the unobserved factors. This is the first time an SC-type method has been shown to reach this level of statistical performance. It offers us a principled and robust estimator for settings with staggered adoption and unobserved confounding.

Who should care?

Anyone working with event studies, staggered adoption or policy evaluations where PT are unreliable should pay attention to this paper. If your treatment timing is potentially related to unobserved factors or if you’re worried that units are changing at different underlying rates, Sequential SDiD gives you a way to build more credible counterfactuals. This is useful in applications like health, labour, and education policy where adoption often happens gradually and for reasons we can’t fully observe.

Do we have code?

No, but the paper includes full algorithmic descriptions and detailed pseudocode for implementation. Algorithm 1 describes the Sequential SDiD procedure and Algorithm 2 outlines the oracle OLS estimator used for benchmarking. You can use the structure outlined in your preferred programming language.

In summary, this paper extends Synthetic DiD to settings with staggered treatment, unobserved time-varying confounding and cohort-level aggregation. The proposed Sequential SDiD estimator builds counterfactuals one step at a time, drawing on the structure of synthetic control and the logic of DiD. The authors prove that it behaves like an oracle OLS estimator, offering consistency, valid inference, and formal efficiency guarantees.

When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators

(Here we are again, discussing heterogeneous treatment effects…)

TL;DR: this paper is about the ongoing debate around using TWFE in staggered treatment settings, where units receive treatment at different times. The authors explain how TWFE can become biased when treatment effects vary across time or groups. They compare TWFE to five newer DiD estimators using Monte Carlo simulations, testing each one under a range of scenarios, including violations of common assumptions.

What is this paper about?

This paper walks us through the growing debate over using TWFE estimators in staggered treatment settings (cases where some units are treated earlier than others). The authors explain how TWFE can be biased if treatment effects vary over time or across groups. They then compare TWFE to five alternative DiD estimators using Monte Carlo simulations and show how each performs under different scenarios, including violations of key assumptions.

What do the authors do?

They say they have three main goals in this research. First, they want to make the recent staggered DiD literature more accessible to social scientists. They start by explaining the traditional TWFE estimator and the problems that recent research has identified with it9. Then they walk through the new alternative estimators in a way that’s easier to understand10. Second, they test how these different estimators perform using real panel data11. They run Monte Carlo simulations with a large sample size and a staggered design where the treatment is spread out over the entire time period, which differs from macro studies where treatment only happens in a few specific periods. In their simulations, they gradually add “realistic” complications that deviate from the perfect conditions these estimators were designed for. This lets them see how well each estimator holds up when things aren’t ideal, which gives practical insights for researchers. At the end, they provide recommendations for best practices when analyzing this type of data.

Why is this important?

TWFE is common practice in applied work. In simple 2x2 settings, it works as it was designed for, but once treatment is staggered across units, it becomes unclear what TWFE is estimating. Instead of a single contrast, it averages over many group comparisons and some of those comparisons involve already-treated units acting as controls, which introduces bias. This matters because many real-world treatment effects are not static. In individual-level panel data, for example, effects often build gradually and then fade out. That inverted-U shape is common in studies of life events, shocks and policy reforms. If you only include a single treatment dummy, you are misrepresenting the data-generating process, and TWFE becomes biased as a result.

This is one of the reasons why we have so many DiD new methods, but the authors point out that the backlash against TWFE has gone a bit too far. There is nothing wrong with using it as long as you model treatment effect heterogeneity in the right way. Event-time indicators are one simple fix, and while the new estimators are designed to handle dynamic effects, they still rely on assumptions12. When that assumption fails, all estimators perform poorly (and sometimes worse than TWFE).

Who should care?

Anyone doing DiD with more than two time periods. If your treatment is staggered, your effects are dynamic, or your units might anticipate treatment, this paper is super relevant. The audience includes applied economists, sociologists and political scientists working with panel data, particularly those deciding between TWFE and one of the newer DiD estimators.

Do we have code?

The authors say at the end that “a replication package with the simulation and analysis code is available on the author’s Github repository [the link will be added]”. I will update this post when it’s released.

In summary, this paper helps bring clarity to a debate that has confused a lot of applied researchers. TWFE is not inherently flawed but it needs to be specified correctly. If treatment effects vary over time, a single treatment dummy won’t cut it. Switch to an event-time specification or use one of the newer estimators. But remember: all of them rely on assumptions like PT and no anticipation. Each has strengths and weaknesses, and the right choice depends on what you’re most worried about in your data.

“We provide the first semiparametric efficiency bounds for DiD and ES estimators in settings with multiple periods and varying forms of the parallel trends assumption, including covariate-conditional and staggered adoption designs”.

Professor Abadie is generally credited as the primary architect of the SC approach. The method creates a “synthetic” version of the treated unit by taking a weighted combination of control units that best matches the pre-treatment characteristics of the treated unit, which makes it possible to estimate causal effects in comparative case studies where traditional experimental methods aren’t feasible. If I’m not mistaken and didn’t miss anything, I think the order is: Abadie and Gardeazabal’s 2003 “The Economic Costs of Conflict: A Case Study of the Basque Country”; Abadie, Diamond and Hainmueller’s 2010 “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program” and 2015 “Comparative Politics and the Synthetic Control Method”; and finally Abadie’s 2021 “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects”.

A key difference from the original SDiD estimator is that the weights for the synthetic controls are not constrained to be non-negative.

I highly recommend this paper.

The authors say that time-varying covariates can present theoretical challenges by acting as additional treatment variables and are thus beyond the scope of the analysis of their paper. They propose three distinct strategies for applied researchers. The first, full stratification, involves running the analysis separately for each covariate stratum, though the authors advise that this is often impractical as it may result in subsamples that are too small. Their recommended strategy is a practical hybrid model where additive fixed effects can depend on the covariate but the multiplicative factors are assumed to be common across all units. This allows us to aggregate the data in a way that makes it directly compatible with the main Sequential SDiD algorithm. The final strategy is designed for the specific case where treatment adoption is a deterministic function of the covariates (e.g., a policy assigned at the state level). In this setting, the authors suggest it is more natural to average the data within the groups defined by the covariate rather than by adoption cohort.

This means that the theoretical results are shown to hold even in a “weak factor” setting, where the identifying variation from the interactive fixed effects can be small and vanish asymptotically, only as long as it does so slower than the statistical noise. The authors point to the fact that this is especially appealing for empiricists.

The paper frames its results as an answer to a “long-standing critique of SC methods”. The critique questions why a researcher should rely on SC weights instead of directly estimating the underlying factor model that motivates the method. The paper’s theoretical equivalence result (Theorem 3.1) fixes this tension by formally showing that in large samples, their SC-based method is the same as an oracle that does use the unobserved factors directly. It clarifies that “the choice is not between balancing and direct estimation, but rather how to feasibly approximate the same ideal oracle benchmark”.

When treatment is staggered over time (some units get treated earlier than others), TWFE doesn’t estimate a clean average treatment effect, it’s “just” a weighted mix of many 2×2 DiD comparisons. Some of these comparisons are valid (like early-treated vs never-treated), but others are “forbidden” (late-treated units compared to already-treated ones), and these can bias results even if treatment effects are constant. The authors argue that the issue is not that TWFE is inherently broken per se, but that using a single dummy (0 = untreated, 1 = treated) assumes effects are the same everywhere and constant over time. We all know that’s rarely the case because real-world treatment effects fade in, fade out, or vary across groups. When we ignore this, TWFE gets it wrong.

Callaway and Sant’Anna (2021) breaks treatment effects into clean group-by-time contrasts and avoids problematic comparisons; Sun and Abraham (2021) is a regression-based version of that same logic; Borusyak et al. (2024) uses untreated and not-yet-treated units to impute what the treated group would have looked like without treatment; Matrix Completion (Athey et al. 2021) is a ML-based imputation method that fills in the untreated matrix using patterns over units and time; and Wooldridge’s Extended TWFE (2021) is a more flexible version of TWFE that interacts treatment with time and group indicators. On a funny note, I started my PhD in 2020 and all of this was published in the subsequent years. I still haven’t finished my PhD, so who knows what will happen in a year from now.

They simulate 6 realistic scenarios that vary along 4 dimensions: how treatment effects evolve (step, trend-break, inverted-U), whether effects differ across groups, whether units anticipate treatment, and whether treated and untreated units were on PT. For each setup, they look at two things: bias in the overall ATT estimate and bias in time-specific effects. TWFE works fine when effects are static and homogeneous, but once you introduce heterogeneity it breaks (unless you use event-time indicators). Even then, late-period bias can come in. The main finding is quite an intuitive one: no estimator is perfect, some are better with anticipation (Borusyak, Matrix Completion, ETWFE), others handle non-PT better (Callaway and Sant’Anna, Sun and Abraham). The right choice? As always, “it depends” (more specifically, it depends on what you’re most worried about).

“Violations of the parallel trends assumption appear to be more consequential than issues of treatment effect heterogeneity”.

Sam Enright

Jun 29

I'm not sure whether you're doing reader requests for papers, but there was a recent synthetic control paper on textual analysis where I was very interested in the subject matter: https://www.journals.uchicago.edu/doi/abs/10.1086/722933?journalCode=jpe. All of the criticisms I could find of it seemed like they were in bad faith, and I don't know how sceptical to be of the results. I was going to discuss it in a recent talk, but in the end, I didn't feel I had enough of a statistics background to be able to opine.

Expand full comment

1 reply by Beatriz Gietner

1 more comment...

DiD Digest

Discussion about this post