In defense of Machine Learning in Economics

Why the common criticisms are outdated, and how new methods are making causal inference stronger.

Jun 10, 2025

Hi there. This post is a “response” to a pre-print I tweeted about last Friday. I read it and thought it was great, and I’ll write about it in the next newsletter post. But that paper per se is not the topic today, and I apologize if the subject is not of your interest. Please feel free to ignore it if you don’t need convincing :)

This post is about me making the case FOR the use of ML in Econ. It feels kind of dated to have to still argue for this, and if “our goal as a field is to use data to solve problems” (which I know some will say it isn’t), then the criticisms warrant even less consideration. I believe the reluctance to adopt ML techniques in Econ stems from two sources1: orthodoxy (the tendency to stick to familiar models and econometric traditions) and purity (the idea that if a method isn’t explicitly tied to a structural model or doesn’t yield a “clean” causal estimate, it’s not real economics).

Part of this reluctance also comes from a misperception: that ML is only useful for prediction and therefore has little to contribute to causal inference or theory-driven analysis. But that’s increasingly out of step with how ML is actually used in empirical research. Consider causal forests which use tree-based methods to identify heterogeneous treatment effects while maintaining the rigor economists demand. Or double machine learning, which combines ML’s predictive power with traditional causal identification strategies to reduce bias in treatment effect estimation. Regularization techniques (e.g., TMLE) help us focus on the parameters we care about most while still leveraging ML’s ability to handle high-dimensional data. Rather than talking about replacing economic reasoning, we can reason from the point of view of enhancing it in a way that helps us extract more nuanced insights from complex data while maintaining causal rigor.

I like to think of these tools as complements, not substitutes, for economic theory. Just like we once borrowed techniques from Stats and Maths, we’re “now” borrowing from CS and engineering. That’s how fields evolve and strengthen. Professor Athey writes: “I believe that regularization and systematic model selection have many advantages over traditional approaches, and for this reason will become a standard part of empirical practice in economics”.

Another point that makes it harder to bridge these two fields, though, is that the culture around publication and validation in CS is so different: pre-prints are fast, feedback is open, and iteration happens in public. In Econ, we move at glacial speed, and nothing feels “legit” until it survives multiple rounds of refereeing. Another point to consider is that criticism often comes from academia. It barely ever comes from Economists at Amazon or Alphabet. Many of the advances in causal ML (like uplift modeling, heterogeneous treatment effects, online experimentation) were incubated in tech companies because they had the scale, the data, and the incentives to care.

Now, I’m obviously not suggesting we should abandon academic rigor for industry speed, and I believe both approaches have their strengths. Industry settings often provide the scale and urgency that drive innovation, while academic settings provide the time and incentives for thorough validation. But we can learn from both.

Just last year, 2/3 of the Nobel in Chemistry was awarded to researchers at Google DeepMind for the creation of AlphaFold2. This kind of breakthrough didn’t come out of a “traditional” academic lab. It came out of industry research, using ML tools to push the frontier of a completely different field. If Chemistry is giving out Nobels for this work, maybe it’s time for us to stop treating ML like it’s not serious science3.

And beyond the science, there’s a very practical angle too: the academic job market is brutal. There are more brilliant PhDs than tenure-track positions, and even fewer in policy-facing or methodological roles that reward creativity. If we don’t expose students to modern tools (*especially* ones that are widely used in industry) we’re not preparing them (and I include myself here) for the full range of careers they may need or want to pursue. The industry isn’t waiting for us to catch up. The least we can do is make sure our students aren’t left behind.

I’ve had to defend these positions a couple of times already, and I think writing about the experience (in order to prepare others for what they’ll likely face if they try to do the same) is worth doing. To make this case more concrete, let me center you in the current state of affairs. Understanding the context helps clarify why the current resistance feels both familiar and ultimately misguided.

But before we move to the critical points, it’s necessary to acknowledge the giants pushing the fields to intersect: Professors Athey at Stanford and Chernozhukov at MIT. They’re leading figures at two of the most prestigious Econ departments in the world, representing a geographic and intellectual complementarity that’s been central to the field’s development4. Professor Athey, who has one foot in Silicon Valley and the other in academia (she was Chief Economist at Microsoft Research), has done a lot to make ML legible to economists (especially by focusing on causal questions and by making the theory accessible to undergrads and postgrads). Her work on causal trees and causal forests is all about taking the structure economists care about and building in the flexibility that real-world data demands. Professor Chernozhukov, meanwhile, has taken the inference side (which makes sense since he’s an econometrician). His work on high-dimensional econometrics (aka the double/debiased ML framework with Belloni and others) showed how you can combine ML’s predictive power with solid causal identification. It’s basically the answer to the “but what about inference?” pushback you still hear in seminars. Together with Professor Athey’s frequent collaborator Professor Imbens and early adopters like Professor Mullainathan, they’ve helped shift the conversation from “prediction versus causation” to “prediction in service of causation”.

To put their contributions in perspective, it’s worth stepping back and tracing how ML and economics first started to intersect. It’s no easy task to figure out where these two “fields” first intersected, or to pinpoint the single first economics paper that used what we now broadly categorize as ML techniques, partly because the definition of "machine learning" has evolved, and partly because some foundational methods were adopted gradually. If we look at when specific techniques came into being, for example, we can start with Ridge Regression (a form of L2 regularization), introduced by Hoerl and Kennard in 1970, and LASSO (Least Absolute Shrinkage and Selection Operator, a form of L1 regularization), introduced by Robert Tibshirani in 1996.

These regularization methods weren’t created with economists in mind, but they quickly became useful for us as datasets got bigger and models more complex. Still, for a long time, they mostly lived on the margins of applied work. You might see LASSO show up in a robustness check or in a variable selection appendix, but it wasn’t central to the analysis. That started to change when the tools stopped being just about prediction and started helping us answer causal questions more transparently.

The real shift came when economists began to recognize that ML could be used reduce overfitting and improve out-of-sample performance, and also to solve problems we already cared about like estimating treatment effects in high-dimensional settings or uncovering heterogeneity we suspected was there but couldn’t capture with linear models. Suddenly, ML wasn’t a detour away from economic reasoning, it was more like a way to get closer to the truth in messy, real-world data.

This is where the crossover with causal inference really took off. Methods like causal forests, double ML, and targeted regularization didn’t just borrow from CS, they adapted those tools to meet our standards for identification and interpretation. That’s why they stuck. And that’s why people like Professors Athey, Chernozhukov, and others were so effective at building the bridge: they spoke the language of economists and they understood the technical machinery involved. So, when people say “ML is just prediction”, you can reply that this is an outdated perspective that overlooks the substantive evolution happening in the empirical literature. The goal was never to abandon causality. The goal was (and is) to do better causal inference with better tools5. That’s a framing I think is worth emphasizing.

Now that you more or less know where we are at the moment and why we need to talk about this, I have a list of the most common criticisms I’ve heard and the way I think we can address them is by pointing at the literature while we try to make the most our of it without losing sight of reality.

1. “It’s just curve-fitting/overfitting”

You’ve probably heard some version of this: ML models are too flexible and will fit noise rather than signal, they don’t generalize well out-of-sample, and economic relationships should be “parsimonious”, not complex. This criticism fundamentally misunderstands how modern ML works. The claim that ML models “overfit” ignores that the entire ML workflow is designed around out-of-sample validation6, which is something traditional econometrics often skips entirely. Mullainathan and Spiess (2017) make this point directly: ML’s obsession with predictive performance makes models more reliable, not less, because it forces researchers to test whether their results generalize beyond the training data. When Belloni et al. (2012, 2014) applied LASSO regularization to causal inference problems, they improved treatment effect estimation by selecting relevant controls while discarding noise, and thus avoiding overfitting. The “parsimony” argument is equally misguided. As Varian (2014) points out, when economic relationships are genuinely complex, forcing them into overly simple models creates more bias than allowing appropriate complexity. The solution is to use principled methods like regularization that balance complexity with generalizability7 and not to pretend the world is simple. We see this in practice: Kleinberg et al. (2018) showed that ML models predicting judicial decisions outperformed simple heuristics precisely because they could handle complexity without overfitting. The irony is that economists criticize ML for “curve-fitting” while often using specification searches and robustness checks that are far more prone to overfitting than properly cross-validated ML approaches.

2. “Black box problem”

This one comes always comes up: you can’t interpret the results, there’s no economic intuition behind the relationships, and policymakers need to understand why something works, not just that it works. But this criticism conflates older ML methods with modern approaches that are explicitly designed for interpretability. Athey and Wager’s (2019) causal forests give you treatment effects and tell you exactly which covariates drive heterogeneity and how. You can’t call it a black box; that’s more interpretable than most traditional econometric models that assume homogeneous effects. Tools like SHAP values (Lundberg and Lee, 2017) and LIME (Ribeiro et al., 2016) can decompose any model’s predictions into interpretable components, showing exactly how each variable contributes to the outcome. When the aforementioned Kleinberg et al. (2018) built ML models to predict judge decisions, they extracted insights about which case characteristics actually matter for judicial outcomes, information that was invisible in traditional analyses. We can’t simply consider every single variable alone and for its relevance. The deeper issue is that “interpretability” often means “fits my priors” rather than “reveals true mechanisms”. A linear regression with 20 control variables isn’t inherently more interpretable than a well-designed tree-based model that shows you which interactions matter the most. As Athey (2018) argues, ML tools can enhance economic intuition by revealing patterns and relationships that traditional methods miss. The question is whether we’re willing to update our understanding when the data suggests more complex relationships than our simple models assumed.

3. “Prediction ≠ causation”

This is probably the most fundamental objection: ML is only good for forecasting, not understanding causal relationships; economics is about identifying causal effects, not just correlations; you can’t do policy analysis without causal identification. After all, we did have an entire revolution concerning this worry. But this criticism is based on a false dichotomy that ignores how modern causal ML works. Chernozhukov et al. (2018) showed exactly how to combine ML’s predictive power with rigorous causal identification in their double/debiased ML framework8: you use ML to estimate nuisance parameters (like propensity scores or outcome models) while maintaining all the causal identification you care about. Athey and Wager (2019) took this further with causal forests by demonstrating how tree-based methods can identify heterogeneous treatment effects with full causal rigor. These are causal inference methods that happen to use ML tools rather than “prediction” methods. A deeper insight comes from Kleinberg et al. (2015), who pointed out that many policy problems ARE prediction problems: predicting who will benefit most from a treatment, predicting which interventions will work in which contexts, predicting the consequences of policy changes. When you frame it this way, the prediction vs. causation distinction starts to look fake. Belloni et al. (2014) solved the lingering inference concerns by showing how to do valid statistical inference after using ML for variable selection. The result? “Better causal inference through better prediction” rather than “prediction instead of causation”. When Davis and Heller (2017) used causal forests to understand treatment heterogeneity in job training programs, they were making causal inference more flexible and informative than traditional approaches ever could, not abandoning it.

4. “No economic theory”

Ok, we are halfway through these. Here’s another one you hear all the time: ML methods are atheoretical: they let the data speak without economic reasoning, economics should be driven by theory not algorithmic pattern recognition (whatever it means), and results lack the structural interpretation needed for counterfactuals. But this gets the relationship between theory and ML backwards. Kleinberg et al. (2015) showed that prediction problems are often at the heart of economic theory: optimal taxation requires predicting behavioral responses, welfare analysis requires predicting who benefits from policies, market design requires predicting how agents will behave under different rules. ML makes this reasoning more rigorous by forcing us to be explicit about our assumptions and test whether our theories predict well. Mullainathan and Spiess (2017) put it perfectly: ML improves traditional econometric practice by providing better tools for the things we were already trying to do. When Gentzkow et al. (2019) used text analysis to study media bias, they were testing theories about how competition affects content in ways that were impossible before ML tools existed, rather than discarding abandoning economic theory. The “structural interpretation” concern is equally misguided. Hartford et al. (2017) showed how to combine deep learning with IVs for structural estimation (akin to what the paper that started this whole conversation was trying to do). Bajari et al. (2015) demonstrated how ML can improve demand estimation while maintaining full economic interpretation. ML doesn’t lack theory (although practice tends to precede it), and it forces us to confront whether our theories really work when tested against complex, real-world data. Sometimes they do, sometimes they don’t, and sometimes the data reveals relationships our theories missed entirely. That’s not atheoretical, that's how science progresses.

5. “External validity concerns”

This criticism also has it backwards. The argument is that models trained on one dataset don’t work elsewhere, that economic relationships vary across time and place, and that we need models that capture stable fundamentals. While the concern is valid, Professor Athey notes that an algorithm might find unstable relationships, like the fact that for a time “the presence of a piano in a video may thus predict cats”, the idea that this makes ML worse than traditional methods is wrong. In fact, traditional approaches that assume homogeneous treatment effects and stable parameters across contexts have a much bigger external validity problem; they just assume it away. ML tools do the opposite: they explicitly model heterogeneity, telling you exactly why and where effects vary. This is the core of external validity. When Athey and Wager developed causal forests they were providing a systematic way to “discover forms of heterogeneity”. Chernozhukov et al. (2018) took this further by developing formal inference procedures that help us understand when results will generalize. And we see this in practice: Kleinberg et al. (2018) built judge prediction models that worked across different courts and time periods precisely because they could identify which factors were stable and which were context-specific. The deeper point is that external validity is about understanding *systematic* variation. When Davis and Heller (2017) used causal forests to study job training programs, they identified which participant characteristics predicted where the program would work and where it wouldn’t. We should not see this as a threat to external validity, but as external validity done right. The irony is that we sometimes worry about ML models not generalizing while using methods that assume away the very heterogeneity that determines whether results will generalize in the first place. Similarly, Gu et al. (2020) demonstrated that ML methods in asset pricing consistently outperform traditional models across different time periods and market regimes. Mullainathan and Obermeyer (2017) found the same pattern in healthcare, where ML models proved more robust across different patient populations than traditional risk-adjustment methods that rely on stable, homogeneous relationships.

6. “Statistical inference problems”

Ok, I will give them that - to some extent. This criticism gets at something real but misunderstands the solution. The concerns are legitimate: traditional standard errors “break down” when you use ML for variable selection, multiple testing becomes a serious issue when algorithms explore thousands of potential relationships, and post-selection inference creates well-known statistical problems. As Professor Athey puts it, a central theme of the new literature is that ML algorithms “have to be modified to provide valid confidence intervals for estimated effects when the data is used to select the model”. What I would say here is that we shouldn’t avoid ML. We should use the statistical innovations that have specifically solved these problems. The modern literature has done exactly that, often using techniques like “sample splitting” and “orthogonalization” to ensure valid inference. For example, Hansen and Kozbur (2014) demonstrated how to do valid inference in high-dimensional panel models. Chernozhukov, Hansen, and Spindler (2015) provide a comprehensive framework for post-selection inference, while Hansen and Liao (2019) developed bootstrap-based approaches for these complex settings. The double/debiased ML framework from Chernozhukov et al. solves the problem by separating the prediction task (where ML excels) from the inference task (where econometric rigor is maintained), giving you the best of both worlds. The multiple testing critique is particularly ironic because traditional econometrics is often far more guilty of this. We frequently check “dozens or even hundreds of alternative specifications behind the scenes” without correcting for it, a practice that invalidates reported p-values (the incentives are there9). In contrast, ML-based methods are systematic and transparent. The causal forests developed by Athey and Wager (2019) comes with an asymptotic theory that provides valid inference even after the algorithm explores complex interactions. The deeper point is that these are not uniquely “ML problems”; they are statistical problems that exist whenever a researcher engages in model selection. The difference is that the ML and modern econometrics literature, including work like selective inference from Tibshirani et al. (2016), has developed principled and transparent solutions where traditional practice often ignored the problem entirely.

7. “It's just a fad”

I will finish with this one (also because the e-mail length has been nearly achieved), and I hope I was able to convince you somewhat of the benefits of using ML techniques in Econ.

My problem with people saying “it’s just a fad” is that this criticism reveals a fundamental misunderstanding of how scientific progress works. The argument goes that Econ has seen many methodological fashions come and go, that core economic insights don’t require fancy new tools, and that traditional econometric methods have worked fine for decades. But this perspective confuses *stability* with *stagnation*. Every major methodological advance in Econ was once dismissed as a fad. Regression analysis itself was once controversial. IVs are still viewed with suspicion. Even RCTs faced resistance before becoming the “gold standard” for causal inference (you can include me on that). The pattern is always the same: initial skepticism, gradual adoption by leading researchers, then widespread acceptance as the new normal. ML is following exactly this trajectory, but with one special difference: the scale and speed of adoption.

Let’s consider the institutional evidence: top economics departments are hiring ML-trained economists, leading journals are starting to publish ML-based research, and the most prestigious conferences feature ML sessions. The Nobel Committee awarded prizes to researchers using computational methods that were unimaginable decades ago. We can’t call this a fad; it’s a permanent expansion of the economist’s toolkit. The “core insights” argument is particularly misguided.

Varian (2014) argued that traditional econometric tools, while effective in many contexts, face serious limitations when confronted with the scale and complexity of modern datasets. How do you study personalized pricing algorithms? How do you analyse text from millions of social media posts or model outcomes using thousands of predictors? Do not label these as “fringe cases” when they are, in reality, a focus point to understanding today’s economic activity. ML offers practical solutions where conventional approaches often fall short. The idea that “old methods worked fine for decades” ignores the empirical reality that those methods frequently break down in the face of high-dimensional or unstructured data. Mullainathan and Spiess (2017) demonstrated that many standard econometric practices perform poorly in out-of-sample tests, and that what “worked fine” may have been an illusion created by never testing predictive performance. The deeper issue is that dismissing ML as a fad reflects a conservative bias that mistakes familiarity for validity. As Professor Athey argues, the question isn’t whether traditional methods are good enough, but whether we can do better. The evidence increasingly suggests we can.

One can even argue about a third driver: institutional incentives (e.g., tenure metrics tied to causal identification papers). Some of the mechanisms in play we can think of are: top journals are more likely to accept papers with clean causal designs, tenure committees count those “causal hits” and discount purely predictive studies, PhD advisors channel students toward DiD rather than neural networks, grant reviewers ask “where’s the treatment?” and mark down ML-only proposals, core graduate courses leave ML out so the entry cost stays high, and authors often graft an instrument onto their models just to satisfy referees. My friend Prashant and Professor Fetzer have an excellent paper on the “rise in the share of causal claims (papers) - from roughly 4% in 1990 to nearly 28% in 2020 - reflecting the growing influence of the “credibility revolution”.” They find that “causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes”.

AlphaFold utilizes ML (particularly deep learning) to predict the 3D structure of proteins from their amino acid sequences. It leverages large datasets of known protein structures and sequences to train a neural network that can identify patterns and relationships in these structures.

“A prediction I have is that there will be an active and important literature combining ML and causal inference to create new methods, methods that harness the strengths of ML algorithms to solve causal inference problems. (…) This new literature takes many of the strengths and innovations of ML methods, but applies them to causal inference. Doing this requires changing the objective function, since the ground truth of the causal parameter is not observed in any test set”. Athey, 2018.

I’d like to thank Professor Marica Valente for this insight. I attended her course in Machine Learning last summer at ISEG and had a great time, she’s awesome.

“Researchers often check dozens or even hundreds of alternative specifications behind the scenes, but rarely report this practice because it would invalidate the confidence intervals reported (due to concerns about multiple testing and searching for specifications with the desired results). There are many disadvantages to the traditional approach, including but not limited to the fact that researchers would find it difficult to be systematic or comprehensive in checking alternative specifications, and further because researchers were not honest about the practice, given that they did not have a way to correct for the specification search process. I believe that regularization and systematic model selection have many advantages over traditional approaches, and for this reason will become a standard part of empirical practice in economics”. Athey, 2018.

“The ML literature uses a variety of techniques to balance expressiveness against over-fitting. The most common approach is cross-validation whereby the analyst repeatedly estimates a model on part of the data (a “training fold”) and then evaluates it on the complement (the “test fold”). The complexity of the model is selected to minimize the average of the mean-squared error of the prediction (the squared difference between the model prediction and the actual outcome) on the test folds. Other approaches used to control over-fitting include averaging many different models, sometimes estimating each model on a subsample of the data (one can interpret the random forest in this way)”. Athey, 2018.

“There are discussions of what interpretability means, and whether simpler models have advantages. Of course, economists have long understood that simple models can also be misleading. In social sciences data, it is typical that many attributes of individuals or locations are positively correlated–parents’ education, parents’ income, child’s education, and so on. (…) So, simpler models can sometimes be misleading; they may seem easy to understand, but the understanding gained from them may be incomplete or wrong”. Athey, 2018.

Chernozhukov et al. have a new pre-print where they provide a practical introduction to Double/Debiased Machine Learning (DML).

“We examine how the evaluation of research studies in economics depends on whether a study yielded a null result. Studies with null results are perceived to be less publishable, of lower quality, less important and less precisely estimated than studies with large and statistically significant results, even when holding constant all other study features, including the sample size and the precision of the estimates. The null result penalty is of similar magnitude among PhD students and journal editors. The penalty is larger when experts predict a large effect and when statistical uncertainty is communicated with p-values rather than standard errors. Our findings highlight the value of a pre-result review”. Chopra et al., 2024.

Some recommended readings, in no specific order:

A Hands-on Machine Learning Primer for Social Scientists: Math, Algorithms and Code, by Nikos Askitas, 2024 (if you are a social scientists that got to this point and want get your hands dirty, I would recommend reading this guide as a starter).

Machine Learning For Causal Inference In Economics, by Anthony Strittmatter, 2025.

Machine Learning Methods That Economists Should Know About, by Susan Athey and Guido W. Imbens, 2019

Financial Machine Learning, by Bryan T. Kelly and Dacheng Xiu, 2023

Lawrence R. Klein’s Principles in Modeling and Contributions in Nowcasting, Real-Time Forecasting, and Machine Learning, by Roberto S. Mariano and Suleyman Ozmucur, 2020

Statistical Modeling: The Two Cultures (paper / slides), by Leo Breiman, 2001 (thanks Gian for this tip!)

Lukas Altermatt

Jun 11

Copied from X: Interesting post. On the Theory bit: I think Econ moved in the wrong direction recently where 'proper' papers are expected to do both theory and empirical work. Often, neither part is fully convincing, and both are engineered to complement each other. I'd prefer a world where some papers focus solely on establishing robust statistical relationships, i.e. stylized facts, and then theorists come up with mechanisms explaining these. Finally, structural people take competing explanations and empirically test which one fits the data best. Currently many papers do all of these 3 steps jointly, and I think that's far from ideal. Circling back to your post, I think ML techniques are definitely useeful in step 1 and probably also in step 3, and there's no reason to dismiss them.

Expand full comment

2 replies by Beatriz Gietner and others

2 more comments...

DiD Digest

Discussion about this post