Hello! Today’s post (somewhat unintentionally) ended up being about four papers that all revolve around the same goal: to push DiD methods further so we can say more than just “it worked” or “it didn’t”. You will read about how they help us answer “for whom did it work? How did it work? And are we even measuring the right thing?”
Here are they:
Triple Difference Designs with Heterogeneous Treatment Effects, by Laura Caron
Forests for Differences: Robust Causal Inference Beyond Parametric DiD, by Hugo Gobato Souto and Francisco Louzada Neto
Estimating Treatment Effects With a Unified Semi-Parametric Difference-in-Differences Approach, by Julia C. Thome, Andrew J. Spieker, Peter F. Rebeiro, Chun Li, Tong Li, and Bryan E. Shepherd
Child Penalty Estimation and Mothers’ Age at First Birth, by Valentina Melentyeva and Lukas Riedel
Triple Difference Designs with Heterogeneous Treatment Effects
(Laura is a PhD candidate at Columbia)
TL;DR: triple-difference (3DiD) designs are everywhere, but the way we usually interpret the estimates requires careful consideration. In this paper Laura shows that the usual approach (comparing subgroups assuming one is unaffected) can mislead us and the reader, especially when people in different subgroups respond differently to the treatment (heterogenous treatment effects). She proposes a new way to think about it: instead of comparing average effects (DATT), she focuses on causal differences (causal difference in average treatment effects on the treated, or CDATT), which isolates how much of the difference is actually because of subgroup status, not just because the groups were made up of people who were already likely to respond differently. Laura lays out what needs to be assumed for this to be valid, proposes estimators that still work when models are misspecified, and shows through simulations from real data that this actually changes what we take away from some widely cited studies.
What is this paper about?
This paper is about how we interpret triple-difference (3DiD) estimates and how easily we can get them wrong if we’re not careful about subgroup comparisons.
The ATT (Average Treatment Effect on the Treated) compares treated and untreated groups over time. The key assumption here is that the untreated group shows us what would’ve happened to the treated group in the absence of treatment. That’s what makes it “causal”: you’re treating the untreated group as a valid counterfactual. The DATT (Difference in ATT) takes that one step further and compares the magnitude of treatment effects across two subgroups. And most importantly, it doesn’t assume either subgroup is unaffected. It just compares how strongly each group responded to the treatment.
3DiD designs use these kinds of comparisons across three dimensions, which usually encompasses time (before vs. after the introduction of a policy), treatment (treated vs. untreated units), and subgroup (which can be a demographic or structural characteristic, e.g. men vs. women, South vs. North). If we were to take a policy mandating paid maternity leave, for example, time would be considered pre vs. post-policy, treatment could be states that implemented it vs. those that didn’t, and subgroups would be women of childbearing age vs. older women.
What we usually do is that we assume one subgroup (e.g., older women) is unaffected by the policy. That group then serves as a kind of within-treatment counterfactual. If their outcomes don’t change post-policy, any observed differences in the other group (e.g., younger women) get attributed to the policy. This gives you an ATT-style interpretation: “this group was affected by the policy, relative to a group that wasn’t”. But that’s a strong assumption. And more often than not, it’s incomplete. What if the “unaffected” subgroup was actually affected in some way, like indirectly, or just to a lesser degree? In that case, the 3DiD estimate no longer tells you what you think it does. Laura shows that once you allow for heterogeneity in treatment effects, i.e., once you admit that different subgroups might be differently sensitive to the same policy, the DATT no longer captures the causal effect of being in one group versus the other. It just tells you that outcomes moved differently. But that difference might be driven by other things like occupation, baseline risk, or other characteristics correlated with subgroup status.
To get around this, she defines a new parameter: the causal DATT (CDATT). Instead of just comparing observed differences, it asks: what would the difference in treatment effects have been if both groups had the same underlying sensitivity to treatment, and the only thing that varied was subgroup status itself? That’s a much “cleaner” question, and also a much harder one to answer. The rest of the paper is about how to answer it properly.
What does the author do?
Laura starts with a common practice: using a 3DiD design to compare how different subgroups respond to a policy, assuming one subgroup can stand in as a kind of internal control, which is the standard 3DiD logic used in studies like Gruber (1994), Baum (2003), and more recently Derenoncourt and Montialoux (2021). But she shows that this logic breaks down fast when subgroups differ in how sensitive they are to the treatment.
Even if identification assumptions are satisfied and your DATT is correctly estimated, the result doesn’t have a clean causal interpretation. Why? Because if people in one subgroup would have responded more strongly to the policy even if they had been in the other subgroup, the difference in outcomes reflects more than just subgroup status: it reflects unobserved differences in treatment sensitivity. That’s not a subtle footnote because it completely changes what the parameter means.
To fix this, she defines a new estimand: the causal DATT (CDATT). It asks: what’s the treatment effect because of belonging to one subgroup versus another, ceteris paribus, including how reactive people are to the policy? That’s a much more policy-relevant question, but it comes at a cost: stronger assumptions, more careful identification, and better estimation tools.
The DATT is still identifiable under the usual assumptions (no anticipation and parallel trends across subgroups). So if you’re just interested in whether two groups responded differently to a policy, the DATT will give you that. But if what you really want to know is why they responded differently (like whether being in subgroup A versus subgroup B caused that difference) then you need to go further. That’s what the CDATT is for. To identify it, you need one more assumption: that treatment effect heterogeneity isn’t itself correlated with subgroup membership. Without that, you’re just comparing aggregates.
Laura’s paper includes both simulations and an empirical re-analysis of Gruber’s maternity leave data. In the simulations, she shows how the usual DATT and her proposed CDATT can lead to very different conclusions depending on the data-generating process. In the empirical application, she revisits a classic 3DiD design and shows how accounting for treatment effect heterogeneity shifts the interpretation, sometimes quite substantially.
The technical framework is grounded in the Imbens and Rubin (2015) potential outcomes model, so if that’s familiar territory, the paper is easier to follow. But even if it’s not, the intuition behind her argument is very clear: comparing subgroup outcomes doesn’t tell you anything causal unless you’re explicit about what you’re holding constant.
Why is this important?
3DiD designs are everywhere in applied work. When the usual DiD assumptions do not quite hold (say, when you cannot confidently claim parallel trends between treated and untreated groups) 3DiD offers a way out. It adds a third dimension (typically a subgroup split) and tries to recover the treatment effect by leveraging variation across time, treatment, and some structural or demographic distinction. But that third layer of complexity opens the door to a whole new set of problems.
While the DiD literature has already wrestled with treatment effect heterogeneity (between treated and control groups, or over time) there was less attention paid to what happens when subgroups themselves differ in how they respond to treatment. And yet, that is exactly what most 3DiD designs rely on: that one subgroup can stand in as a control for another.
Laura’s paper fills that gap. It shows that if we are not careful, we can end up interpreting subgroup differences as causal when they are really just driven by selection, sorting, or other underlying differences in treatment sensitivity. And if the whole point of 3DiD is to recover cleaner effects in messy empirical settings, then failing to address this undermines the design at its core.
Laura offers a formal framework for thinking about causal subgroup comparisons in 3DiD, shows what needs to be assumed, and provides robust estimators that work in the presence of heterogeneity. It is a nice correction to how subgroup analyses are often handled in 3DiD settings, and it is really useful now that staggered treatments and increasingly rich data structures are becoming more common.
Who should care?
Anyone using 3DiD to compare subgroups. If your paper assumes one group is not affected by the policy, or if you say “this group was more affected than that one”, then you should check Laura’s paper out. It matters most if treatment happens at different times (staggered), there’s the possibility of spillovers, and you’re comparing groups like men vs. women, or states in the North vs. South. Also if your story is about why the policy worked more for one group than another, this paper shows you what you need to assume to make that claim.
Do we have code?
No public code, but the paper walks through both a simulation and an empirical example using Gruber (1994). The simulation shows how DATT and CDATT can tell different stories depending on how treatment effects vary. The empirical example shows how ignoring that variation can lead you to the wrong conclusion.
In summary, 3DiD designs are everywhere, but the way we interpret subgroup comparisons often leans too hard on assumptions we do not check. This paper shows how easily treatment effect heterogeneity can throw things off and offers a better way to frame and estimate what we really care about. If you’re using subgroup comparisons to tell a causal story, this is the paper that tells you what that story actually needs.
Forests for Differences: Robust Causal Inference Beyond Parametric DiD
TL;DR: this paper introduces DiD-BCF, a new Bayesian ML method that makes it easier to estimate treatment effects in DiD settings, particularly when we “have” heterogenous treatment effects and staggered adoption. It builds on Bayesian Causal Forests, but reworks the model to fit panel data and common policy setups. The key innovation is a way to make the estimation task simpler by using the parallel trends assumption more effectively. The result is more accurate and flexible estimates of average, group-specific, and conditional treatment effects.
What is this paper about?
The paper proposes a new non-parametric method for causal inference in DiD settings. It is called DiD-BCF and is designed to deal with two major challenges in modern DiD applications: staggered treatment adoption and treatment effect heterogeneity. To do this, the authors extend Bayesian Causal Forests to panel data settings. Their method estimates: average treatment effects (ATT), group-level effects (GATT), and conditional effects (CATT), all within a single flexible framework. A key feature of the method is a new way of using the parallel trends assumption to simplify the estimation task. This helps improve the accuracy and stability of the results, especially when standard parametric DiD methods struggle.
What do the authors do?
They develop a new method (DiD-BCF) that combines the credibility of DiD with the flexibility of Bayesian Causal Forests. Instead of relying on fixed-effects regressions or linear models, they propose a fully non-parametric approach that can handle staggered treatment adoption and treatment effect heterogeneity in a single unified framework.
They start by generalising the standard DiD model, allowing for complex, nonlinear relationships between outcomes, time, and covariates. Then they extend Bayesian Causal Forests to panel data, so the model can recover dynamic and covariate-specific treatment effects over time. A key part of their strategy is to reparameterise the model using the parallel trends assumption, so instead of forcing the model to learn that treatment effects should be zero before treatment begins, they build that into the structure from the start. This makes estimation more stable and less prone to error.
To make the model more practical for applied researchers, they also include a procedure that speeds up convergence using a more efficient tree-fitting algorithm before switching to full Bayesian estimation. Finally, they test their method across a range of simulated settings that mirror real-world challenges, such as nonlinear trends, selection into treatment, and heterogeneity in effects, and compare it to leading alternatives like TWFE, DiD-DR, DiD2s, SDID, and DoubleML. Across the board, DiD-BCF performs well, especially in the kinds of cases where standard methods tend to break down.
Why is this important?
In observational settings, we rarely get (to do) randomised experiments. Treated and control units usually differ in ways that matter, whether in background characteristics, exposure, or timing, which makes simple comparisons misleading. That is why quasi-experimental methods like DiD have become central to applied work. DiD gives us a way to estimate causal effects from policy changes or discrete events, as long as the identifying assumptions can be justified.
Over the past few years, there has been a wave of new methods that improve DiD by dealing with common problems like staggered treatment timing or variation in treatment intensity. But most of these still focus on estimating average effects, either the overall ATT or group-time averages, which is useful, but often not enough.
In many real-world applications, the question is not just did the policy work on average but for whom did it work, and by how much. We need to understand the heterogeneity in treatment effects if we want to say something about mechanisms, or about which groups benefit more or less from an intervention. This is where CATT come in: they tell us how effects vary across observable characteristics.
To estimate CATTs, researchers have started turning to ML tools like Causal Forests. These models are flexible enough to recover treatment effect variation without needing to specify it upfront, and recent work has adapted them for use in DiD settings. If done carefully, this kind of approach lets us combine the identification logic of DiD with the flexibility of non-parametric estimation so we can detect not just whether something worked, but for whom it worked, and when.
This paper fits right into that agenda. It pushes the literature forward by offering a way to estimate CATTs in panel data under staggered adoption, all while maintaining interpretability and robustness. That makes it especially useful for policy evaluations where average effects might hide meaningful variation across time, space, or population subgroups.
Who should care?
This paper will be especially useful to applied researchers working with panel data, where treatment doesn’t happen all at once and where you suspect that not everyone is affected in the same way. If you work in labour, education, health, or policy evaluation more generally, and you’re kinda frustrated with the limitations of TWFE or concerned about heterogeneous effects being averaged away, this model is worth looking into.
It’s also useful for people who want unit-level or group-specific effects, not just a single ATT. And for researchers who want to use ML tools but still work within a familiar causal inference framework like DiD → this is a bridge between the two.
Do we have code?
The authors rely on existing packages like did2s
, did
, and DoubleML
, and while their proposed model is based on well-established tools like BCF and XBART, some of the benchmark methods (like CFFE and MLDID) couldn’t be included in the simulations due to GitHub installation issues. Still, the core components for reproducing DiD-BCF and the comparisons are all available or well-documented. If you’re familiar with R and Bayesian modelling, you should be able to adapt their setup pretty easily.
In summary, DiD-BCF is an interesting new tool that brings flexibility to DiD. It replaces rigid parametric assumptions with a fully Bayesian, nonparametric model that can handle staggered treatment timing, heterogeneity in effects, and selection on observables, all in one go. By leaning on Bayesian Causal Forests and smartly reparameterising the treatment effect, the method improves estimation accuracy without giving up the core logic of DiD. The result is a model that performs well even when traditional approaches break down, and that opens the door to much richer treatment effect analysis in applied work.
Estimating Treatment Effects With a Unified Semi-Parametric Difference-in-Differences Approach
TL;DR: most DiD methods focus on average treatment effects (ATT) and assume parallel trends in means. But when the outcome is skewed, ordinal, or censored, estimating means can be misleading or hard to interpret. In this paper the authors introduces a new semi-parametric DiD estimator that allows researchers to estimate four treatment effects: average, quantile, probability, and Mann-Whitney, using a single model and a unified assumption. The method performs well in simulations and is applied to evaluate the impact of Medicaid expansion on CD4 counts among people living with HIV.
What is this paper about?
In this paper the authors present a new DiD estimator that can recover a quite comprehensive range of causal effects, not just averages but also quantiles, probabilities, and rank-based measures. And they do using a single model and a single identification assumption. Their key idea is to replace traditional mean-based comparisons with a semi-parametric cumulative probability model (CPM)1. Instead of assuming a specific transformation (like logs) to deal with skewed outcomes, the CPM treats the outcome as a monotonic transformation of a latent variable.
The authors focus on four causal estimands, all defined in terms of the marginal distributions of potential outcomes for the treated group: average effect on the treated (ATT) the effect on a specific quantile of the treated outcome distribution (QTT), the change in the probability that the outcome is below a given threshold (PTT), and a rank-based measure that captures the probability a treated person would outperform a comparable untreated person (MTT). Because these estimands depend only on marginal, not joint, potential outcome distributions, the method requires fewer assumptions than alternative approaches. All four effects are then identified using a single conditional parallel trends assumption, made on the latent scale (rather than on the original outcome scale or after transformation). This assumption states that, conditional on covariates, the untreated change over time for the treated group would have followed the same trend as the control group not in observed outcomes, but in the underlying latent variable used by the model.
What do the authors do?
The authors’ proposed DiD estimator can recover the four aforementioned types of causal effects (ATT, QTT, PTT, and MTT) using a single model and a single identification strategy, which is remarkable because it contrasts most existing approaches where estimating each of these effects would require a separate set of assumptions and often a different estimation method.
To operationalise their approach, the authors define a semi-parametric DiD estimator based on a CPM, allowing them to handle skewed, ordinal, or otherwise non-linear outcome distributions without requiring a pre-specified transformation. Then they move on to formally define the four estimands using only marginal potential outcome distributions for the treated group. This keeps things simpler by not trying to model how each treated person’s outcome would have compared to their own untreated potential outcome, which usually requires stronger, less realistic assumptions.
As mentioned, they rely on a single identification assumption, which is a conditional parallel trends assumption on the latent outcome scale. Unlike our parallel trends in means (or transformed means), their assumption is made on the underlying latent variable. They then evaluate performance through simulation, varying sample sizes (n = 200 to 2,000) and generating a large pseudo-population to benchmark the true values of each estimand. The simulations show good performance even in small samples and under data skewness.
In the last part of the paper, they apply the method to real data by estimating the impact of Medicaid expansion in the U.S. on CD4 cell count at HIV care entry (a classic example of a skewed, clinically relevant outcome). They find that the policy actually had a positive impact across all four estimands, not just on the average outcome.
Who should care?
Have you log-transformed your variables before? Does your data look not-normal? If so, you should read this paper. Also pretty much anyone else using wanting to model DiD with “non-standard outcomes” (skewed, ordinal, censored, etc) will find this paper useful. I’d say it’s particularly useful for applied researchers in health and labour economics, where we often find rank-based or distributional effects to matter more than mean shifts.
Do we have code?
Not yet, but hopefully in the future.
In summary, this paper introduces a flexible, semi-parametric DiD method that lets you estimate a wide range of treatment effects using a single model and a single identification assumption. It’s a compelling alternative to the usual mean-based approaches, especially when dealing with skewed or messy outcomes. And by including estimands like the Mann-Whitney2 treatment effect, it opens the door to more interpretable, rank-based causal questions that standard DiD methods often lack.
Child Penalty Estimation and Mothers’ Age at First Birth
TL;DR: ever heard economists saying “the gender pay gap is mostly due to motherhood penalty”? This paper takes this question and shows we might be underestimating the motherhood penalty (in wage terms) by about 30% because we’re averaging across mothers who are nothing alike (or, in other words, standard methods bias the estimated penalty by ignoring staggered timing and treatment heterogeneity). The authors then propose a cleaner DiD approach that estimates the penalty separately by age at first birth, and as a result they find big differences not just in size, but in what the penalty really means for younger vs older mothers.
What is this paper about?
“Motherhood is still costly for the careers of women”. In this paper, the authors take something we all say (that the gender pay gap is largely driven by the career costs of motherhood) and ask a simple, yet a bit uncomfortable, question: what if we’ve been measuring those costs incorrectly? The standard approach so far has been a big event study centred on the first birth, tracking earnings before and after, and showing that women’s earnings fall sharply while men’s barely move a centimeter. The problem, the authors argue, is that this kind of model implicitly assumes that all mothers are the same. But of course they’re not. The age at which someone has their first child is correlated with education, career stage, earnings, occupation, and parental background (and lots of other unobserved decisions). So when we pool all mothers together, we’re collapsing both very different types of women and very different types of penalties into one average, and that average not only misleads but also obscures meaningful variation policymakers should care about.
They find lots of interesting bits. First, the penalty is actually larger than previously estimated. In their preferred specification, there’s a cumulative earnings loss of nearly €30,000 by year four (about €10,000 more than what a standard event study would suggest), which isn’t a small correction. It reflects the fact that conventional event studies systematically understate what women’s earnings would have looked like in the absence of children, largely because their control groups include women who have already had children themselves. That violates the parallel trends assumption and biases the counterfactual downwards, making the penalty look smaller than it actually is.
Second, the penalty grows with age in absolute terms, but shrinks in relative terms. Older mothers lose more in euros because they were earning more to begin with. But younger mothers lose a larger share of their pre-birth income because they’re cut off just as their wage growth would have accelerated.
Third (and this is what I found most interesting), the nature of the penalty differs. For older mothers, the penalty is mostly about reducing hours, exiting temporarily, or giving up seniority. For younger mothers, it’s about missing the steepest part of the wage trajectory entirely. It’s not the same shock. One is a level shift. The other is a slope change. This is super important from a policy perspective: helping someone who missed a promotion track requires a different tool than helping someone who stepped back from senior management, for example.
What do the authors do?
The paper starts from a now-familiar problem in applied work: when treatment timing is staggered and treatment effects vary across units, conventional event study models can break down. In this case, those models do two things we definitely want to avoid. First, they make forbidden comparisons by including already-treated women in the control group. Second, they suffer from contamination, where estimates for one time period “bleed” into another, especially when pre- and post-birth windows aren’t cleanly separated.
The root of the problem is that standard event study models pool together younger and older first-time mothers, implicitly assuming they’re comparable and that the effects of childbirth are uniform across them. But that assumption, as the authors put it, is “unlikely to hold.” Mothers at different ages differ in earnings, career stage, education, and trajectories. Pooling across them “masks” both the size and the shape of the penalty.
To fix this, the authors propose a stacked DiD design (closely following Wing, Freedman, and Hollingsworth 2024, with attention to overlapping cohorts), combined with a rolling window of control groups by age at first birth. Instead of estimating one big average effect, they estimate separate effects for each age-at-birth group, and they’re very careful about who gets used as a counterfactual. For each group of treated mothers, the control group is made up of not-yet-treated women who will give birth at slightly older ages, observed before they become mothers. So if you’re looking at women who give birth at 25, the control group might be 26- or 27-year-olds who are about to give birth but haven’t yet. That way, everyone in the comparison is close in age, close in life stage, and still on the pre-birth trajectory.
As they describe it, this “combination of a stacked DiD with a rolling window of control groups enables [them] to eliminate the issues present in conventional event studies and estimate the age-at-birth-specific effects of childbirth on post-birth labor market outcomes.” No already-treated mothers in the control group. No artificial smoothing across life stages. Just clean, age-specific estimates of what motherhood does to earnings.
Once the age-specific penalties are estimated, they’re aggregated using sample shares as weights. The resulting average is substantially larger than the pooled estimate, and, most importantly, it makes clear that the penalty is not one number, but a set of distinct experiences depending on when motherhood begins.
Why is this important?
The motherhood penalty is one of the largest contributors to gender inequality in the labour market, but if we measure it incorrectly, any policy built on that estimate risks being too late, too generic, or simply misdirected. What this paper shows is that how we estimate the penalty matters. Conventional event studies often use control groups that include women who have already had children, violating parallel trends and biasing the counterfactual downwards. That alone is enough to underestimate the true cost of motherhood, especially because what is being missed isn’t just earnings levels but the growth that would have happened absent childbirth. The timing of motherhood matters because the nature of the penalty changes with age: reducing hours or exiting work hits harder when your earnings are high, but missing wage growth early on can permanently derail a trajectory. This is a methodological paper, but it goes well beyond that. By analysing the effects of motherhood by age at first birth, it opens the door to more targeted and better-informed policy, recognising that women who have children at different stages in their life and career also respond differently to support and constraints.
Who should care?
If you work on gender gaps and use event studies, this paper is a must-read. Same goes if you’ve ever used the phrase “motherhood penalty” without checking whether your estimate holds across women with different life paths. But it also matters more broadly for anyone interested in how we measure inequality, because it shows that when treatment happens it’s often a source of information in itself (not a nuisance you have to adjust for). If we keep pooling heterogeneous groups, we risk erasing the very dynamics we claim to study. So even if you do not study gender, if you work with staggered treatments, event-time estimators, or stacked DiD designs, this is a paper worth reading and assigning to your students.
Do we have code?
It’s an application paper, so no.
In summary, this paper is a methodological critique with real-world stakes. It shows that when we average across all mothers to estimate the cost of motherhood, we get a number that is not just wrong but misleading. The timing of motherhood shapes the size, shape, and meaning of the penalty, and by pooling across it, we risk flattening nuance into noise. The stacked DiD approach they propose fix the bias and brings to the surface an economically and policy-relevant heterogeneity that would otherwise be lost. If we care about inequality, then we also need to care about how we measure it.
A cumulative probability model (CPM) models the probability that the outcome is less than or equal to a given value (P(Y ≤ y)), based on covariates. Instead of modelling Y directly, it assumes there’s a latent variable Y* that follows a linear model, and the observed outcome Y is a monotonic transformation of Y*. The function linking Y* and Y (denoted H) is unspecified and estimated from the data, making the model semi-parametric. This allows the CPM to flexibly handle skewed, ordinal, or censored outcomes without needing to pre-specify a transformation like log(Y).
This paper is a joint work between authors from Biostats, Public Heath and Economics departments, so the Mann-Whitney parameter might not be familiar to us. The Mann-Whitney parameter comes from the Mann-Whitney U test (also called Wilcoxon rank-sum test), and it's often interpreted as the probability that a randomly selected observation from group A is larger than one from group B. So if group A is treated and group B is untreated, and the probabilistic index is 0.65, it means there’s a 65% chance a treated person scores higher than an untreated one. In the paper they define a Mann-Whitney treatment effect among the treated (MTT), which is a probabilistic measure of treatment impact. The Mann-Whitney parameter is a descriptive, non-causal rank-based comparison between two groups, while the MTT turns that idea into a causal estimand in DiD by comparing treated individuals’ actual outcomes to their estimated counterfactuals under no treatment, using the rank-based structure of the Mann-Whitney statistic. Instead of asking how much the outcome increased (like the ATT does), it asks a different question: what is the probability that a treated person has a better outcome than a comparable untreated person? As an example, if the MTT is 0.70, it means there’s a 70% chance that a randomly selected treated individual will have a higher outcome than a randomly selected untreated individual. If the MTT is 0.50, it means treatment had no effect on rank → treated and untreated units perform equally, on average. The authors show how to estimate the MTT in a DiD setup, and according to them, this is the first time anyone has done this.