DDD Estimators, Distributional Effects, Causal Diagrams, and Cross-Field Counterfactuals

New methods and perspectives in causal inference research

May 23, 2025

Hi there! Hope you’re all doing well :)

Today we have all sorts of studies, and one “treat” at the end :)

Better Understanding Triple Differences Estimators, by Marcelo Ortiz-Villavicencio and Pedro H. C. Sant’Anna
Distributional Difference-in-Differences with Multiple Time Periods, by Andrea Ciaccio
Using Causal Diagrams To Assess Parallel Trends In Difference-In-Differences Studies, by Audrey Renson, Oliver Dukes and Zach Shahn
From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI, by Galit Shmueli, David Martens, Jaewon Yoo, and Travis Greene

(I apologize if there are any mistakes here - let me know in the comments! I confess I haven’t properly slept in the past 5 days, and I’ve been writing between airports lol terrible idea)

Better Understanding Triple Differences Estimators

(I am so happy to cover this paper! Marcelo is my friend and prof Pedro’s - who needs no introduction - PhD student. Here you can check a thread about it in prof Pedro’s words :))

TL;DR: DDD estimators are commonly used to recover treatment effects when standard DiD PT assumptions are too strict. DDD allows for certain violations (e.g., group-specific or eligibility-specific trends) while still producing “valid” estimates if the identifying assumptions are correctly specified. But here’s the problem: in many realistic settings, those assumptions only hold after conditioning on covariates. Failing to account for this - by erroneously proceeding as if DDD was just the difference between 2 DiD estimators - leads to bias. Prof Pedro and Marcelo propose new regression-adjusted, inverse probability weighted, and doubly robust estimators that remain valid in these more realistic scenarios. They also show that in staggered adoption settings, pooling not-yet-treated units (as is practice in DiD) invalidates DDD. The result is a practical, modern framework for credible DDD estimation, particularly in applications with covariates, staggered timing, or event studies.

What is this paper about?

DDD (Difference-in-Difference-in-Differences) is a common strategy when treatment depends on two conditions, like living in a treated state and being eligible for a policy. It’s attractive because it lets you relax the standard DiD assumption that treated and control groups follow the same trends in the absence of treatment. Instead, DDD allows for different trends across groups, as long as the difference in trends between treatment-eligible and ineligible units is the same in treated and control areas (and vice versa). But the problem is that, in practice, most applications assume *too much*. When covariates matter, or when treatment is staggered, the typical DiD logic fails. This paper shows why and what to do instead.

What do the authors do?

They show that the usual approaches (subtracting two DiDs or running a three-way fixed effects regression) don’t work when covariates are important. Worse, if treatment is staggered, pooling all not-yet-treated units (as in modern DiD) can introduce bias due to changing group composition. To fix this, they introduce three estimators that stay valid under the right assumptions: regression-adjusted, inverse probability weighted (IPW), and doubly robust (DR). The DR version is especially flexible because it can be used with machine learning and allows combining multiple comparison groups using GMM to boost precision.

Why is this important?

DDD is meant to be more “flexible” than a 2x2 DiD, but if you implement it like a 2x2 DiD (by ignoring covariates or picking up controls like crazy), it won’t give you the correct estimates. Marcelo and Prof Pedro show why these choices lead to bias, and offer both a diagnosis and a solution: tools that let you actually relax the PT assumption without breaking identification. They also make a key point that’s often overlooked: DDD designs with staggered adoption cannot be treated as simple extensions of DiD methods. They emphasize that the interaction between timing, eligibility, and changing group composition introduces complexities that demand more careful attention to identification and control selection.

Who should care?

Anyone using DDD in applied work, especially if:

Your treatment depends on both where and who (e.g., geography and eligibility),
Covariates are essential for identification,
Your design involves staggered treatment,
You’re running event studies or subgroup analyses.

Empirical economists, policy evaluators, health and education researchers, and methodologists in general will find this useful. And honestly, so will we grad students running triple differences.

Do we have code?

Marcelo and prof Pedro say that they are “finishing an R package that will automate all these DDD estimators, hopefully making it easier to adopt”. So we wait :)

In summary, DDD isn’t just “two DiDs”, and treating it that way can seriously bias your results. This paper shows exactly why that happens, and more importantly, how to fix it. Marcelo and Prof Pedro give us toolkit for doing DDD right: estimators that are flexible, robust, and grounded in solid identification logic. If your setting involves staggered timing, covariates, or layered eligibility rules, this is the framework you should check out. And when the R package drops, I’ll let you know but keep an eye out for it.

Distributional Difference-in-Differences with Multiple Time Periods

(Andrea is a PhD candidate at Ca' Foscari University of Venice)

TL;DR: most DiD estimators focus on average treatment effects. But what if you care about how a policy affects different parts of the outcome distribution, like the bottom 25% vs the top 25%? This paper extends the Quantile Treatment Effect on the Treated (QTT) framework to multiple time periods and staggered adoption settings. Andrea generalises and builds on Callaway and Li (2019) to recover the full counterfactual distribution (not just means or quantiles!) without needing rank invariance. The result? A flexible, nonparametric method to estimate distributional effects using either panel data or repeated cross-sections.

What is this paper about?

This paper is about moving beyond averages. In many applied settings, we care both about whether a policy had an effect and where in the outcome distribution those effects occurred. Imagine a minimum wage hike. It might raise *average wages*, but does it help low-income workers more, or mostly benefit those in the middle of the distribution? Average treatment effects won't tell you. Andrea proposes a method to estimate distributional treatment effects in settings with multiple periods, staggered adoption, and non-experimental data. The goal is to recover the full counterfactual distribution of untreated outcomes for the treated group. To do this, he builds on the QTT framework and generalises it to work in more realistic settings. The nicest part is that his approach does not rely on rank invariance, which is often violated in practice. Instead, he also introduces tools like stochastic dominance tests to compare distributions more credibly.

What does the author do?

He starts by generalising the QTT estimator from Callaway and Li (2019) to a more realistic setting with multiple time periods, staggered treatment, either panel or repeated cross-section data. Andrea uses a combination of distributional parallel trends and a copula invariance assumption to identify the full counterfactual distribution of untreated outcomes for the treated group. He then proposes new estimands that are robust to violations of rank invariance, such as comparisons using stochastic dominance rankings and inequality measures like the Gini index. He also provides an actual estimator. The only other paper addressing QTT identification in staggered DiD settings - Li and Lin (2024), which I mentioned in another post - stops at theory and does not propose an estimator, which limits its applicability. Andrea’s contribution fills that gap with a method researchers can use.

Why is this important?

This paper is important because heterogeneous effects matter. Policies rarely impact everyone the same way, and when we focus only on averages, we risk missing who actually benefits (or loses). Andrea’s method lets us go beyond the ATT and estimate how the entire distribution shifts, which is especially relevant for policies that aim to reduce inequality or target disadvantaged groups. It also matters methodologically: while earlier papers offered theoretical identification of distributional effects under staggered DiD, Andrea is the first to provide a fully worked-out estimator. The fact that his approach works with repeated cross-sections and avoids rank invariance makes it much more usable in real-world applications.

Who should care?

Researchers working with staggered DiD who want to go beyond average treatment effects
Applied microeconomists studying income, education, labour markets, or health, where heterogeneous impacts are expected
Anyone evaluating policies with distributional goals such as minimum wages, subsidies, school tracking, etc
People using repeated cross-sections instead of panel data
And of course: grad students who keep reading “QTT on staggered DiD” papers and wonder, okay, but how do I actually estimate that?

Do we have code?

Andrea says that “all the MC simulations were run in STATA. The ado files of the command used for implementing the methodology presented in the paper, qtt, are available upon request at the time of writing”.

In summary, Andrea’s paper gives applied researchers a way to estimate how the entire outcome distribution would have evolved in the absence of treatment, even in complex staggered DiD settings. It works with repeated cross-sections, allows for conditioning on covariates, and avoids the strong rank invariance assumption that most distributional methods rely on. Once you have the full counterfactual distribution, you’re not limited to quantiles, you can compute Gini coefficients, Lorenz curves, or test for stochastic dominance. This makes the method flexible and genuinely useful for policy evaluation, especially when heterogeneity is central.

Using Causal Diagrams to Assess Parallel Trends in DiD

(This paper has brought flashbacks from Econometrics 2 hehe hi prof Ben if you’re reading! Also, it’s quite “visual”, which is nice)

TL;DR: DiD relies on the PT assumption, but we usually just hope it holds, or eyeball pre-trends. This paper proposes a more “principled” approach: the use of causal diagrams to assess whether parallel trends are plausible before running the analysis. The authors derive three graphical conditions under which parallel trends likely fails, and show how to apply them using partially directed SWIGs (a tool used in causal inference to represent potential within a graphical model) and a linear faithfulness assumption.

What is this paper about?

Most of us assess parallel trends by plotting outcomes before treatment and hoping the lines run in parallel. But this only works if we have multiple pre-treatment periods, and even then it might not be sufficient. This paper proposes an alternative: use causal diagrams to assess whether PT can plausibly hold, based on what we know about the data-generating model (DGM). The authors focus on the standard 2×2 DiD setup (two groups, two periods) and ask a simple but deep question: under what structural conditions can we actually expect PT to hold? To answer it, they use nonparametric structural equation models and tools from causal graph theory. One key idea is linear faithfulness, which is a principle that says if two variables are connected in the DAG (i.e., d-connected), then they should also be statistically correlated in the data. This lets the authors derive graphical conditions under which the PT assumption is likely violated, just by inspecting the structure of the DAG.

What do the authors do?

The authors show how to use causal diagrams (more specifically something called a Single World Intervention Graph, or SWIG) to evaluate whether the PT assumption is even plausible, based on what we know about the DGM. A SWIG is a type of causal diagram that represents potential outcomes under specific interventions within a single possible world, encoding counterfactual relationships in graphical form while accounting for how unmeasured confounding might affect outcomes. It’s a way to visualise whether treated and control units would have followed the same trend, had neither been treated. They identify three structural features in a SWIG that signal PT is likely violated:

If pre-treatment outcomes influence who gets treated
If different unmeasured confounders affect pre- and post-treatment outcomes
If pre-treatment outcomes directly influence post-treatment outcomes

If any of these are in your graph, then standard DiD assumptions likely fail even if your pre-trends look fine. The authors also describe the largest possible causal structure that still permits PT, and provide R code (using the dagitty package) so you can test these conditions using your own diagrams. (While I was doing research for this paper, I found this website which is super fun to play with).

Why is this important?

What this paper shows is that we can do better than just eyeball pre-trends. Instead of relying on visual checks or intuition, we can use causal diagrams to *reason formally* about whether parallel trends is *even* plausible to begin with, based on our assumptions about the DGM. This is especially useful in applied settings where we already use DAGs to justify identification strategies like unconfoundedness or IV. Now, we can apply the same logic to DiD, and potentially avoid relying on a violated assumption that undermines our entire design. It’s also a step toward a more unified causal inference framework, one where DiD assumptions are made transparent and testable before estimation begins.

Who should care?

Researchers using DiD with limited pre-treatment periods, especially in observational data
Applied economists who already use causal diagrams for IVs or unconfoundedness and want to bring the same rigour to DiD
Methodologists interested in making identification assumptions explicit
Reviewers and instructors who want a clearer way to teach or check the PT assumption
Anyone who has ever said “the pre-trends look parallel, so we’re probably fine”

Do we have code?

I mean, you can kind of draw it on paper, but it does help to be able to do it on the computer. The authors provide R code in the Appendix (they use the dagitty package) to help you check whether your DAG satisfies their conditions. It’s a helpful way to make the theory actionable.

In summary, if you’re using DiD and assuming PT, this paper gives you a smarter way to ask: is that actually plausible? Instead of relying on visual pre-trend checks or hoping for the best, you can use a causal diagram to make your assumptions explicit and formally test whether PT is likely to hold. It’s clean, insightful, and genuinely useful. One of those papers that quietly upgrades how we think about something we thought we already understood, I enjoyed reading it :)

From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI

TL;DR: “Counterfactual” means different things depending on who you ask. In causal inference (CI), it’s about estimating what would have happened under a different treatment. In explainable AI (XAI), it’s about tweaking inputs to change a model’s prediction. This paper lays out a unified framework to understand both, compares how counterfactuals are used and evaluated across fields, and points to ways CI and XAI can learn from each other.

What is this paper about?

Both CI and XAI are built on “what if” logic, but they use counterfactuals in very different ways and often talk “past each other”. This paper provides a much-needed comparison between the two, showing how counterfactuals are defined, used, evaluated, and interpreted in each field. CI focuses on the THEN part - what happens under an alternative treatment. XAI focuses on the IF part - what input values would lead to a different predicted outcome. This shift in emphasis leads to big differences: in the quantity of interest, the assumptions made, the level of aggregation, and what is being modified (the model vs. the data). The authors propose a general definition of counterfactuals that works for both domains and lay out a roadmap for how ideas from each can inform the other.

What do the authors do?

They break down the counterfactual into its core parts (inputs and outcomes) and then walk through how each field uses it:

Purpose: CI estimates causal effects; XAI explains a prediction
Assumptions: "CI assumptions are about the data-generating process; XAI typically treats the model as given for the purpose of explanation
Quantity of interest: CI compares outcomes; XAI changes inputs
Aggregation: CI typically works at the group level (e.g. ATT, ATE); XAI focuses on individual-level predictions
Modification target: CI modifies the model or treatment variable; XAI modifies the input data
Evaluation: CI cares about confidence intervals and robustness; XAI cares about feasibility, proximity, and sparsity

Why is this important?

The word “counterfactual” is everywhere, but depending on your background, you may be using the term without realising how much baggage it carries from other fields. This paper clears that up. More importantly, it opens the door to “cross-fertilisation”: CI researchers can borrow ideas from XAI about individual-level interpretability, XAI researchers can adopt tools from CI to anchor their explanations in causal logic, And both sides can contribute to a broader understanding of actionable, responsible decision-making in high-stakes settings like healthcare, education, or finance.

Who should care?

Empirical researchers using counterfactuals in any form (policy, medicine, business, etc.)
Anyone working at the intersection of prediction and explanation (ML folks!)
XAI researchers who want to build more meaningful and robust explanations
CI researchers curious about individual-level guidance and model-based insights
Anyone writing about algorithmic fairness, recourse, or model transparency

Do we have code?

No, it is a conceptual paper, but it *does* give you a framework you can apply to your own work. The examples (hiring decisions, loan rejections, ad targeting) are accessible and very usable for teaching or presentations.

In summary, this is a paper about translation. If you've ever used the word counterfactual in a paper, a model, or a policy brief, this helps clarify what you're really doing. CI and XAI both rely on “what ifs,” but this paper shows that they mean different things, serve different goals, and run on different assumptions. I think that understanding this is the first step toward doing better science, in both fields.

DiD Digest

Discussion about this post