Why DML IRM is the Best Modern Choice for Causal Inference

Double Machine Learning (DML) with the Interactive Regression Model (IRM) framework represents the state-of-the-art approach for causal inference from observational data. Here's why it has become the preferred method for modern causal analysis:

1. Combines Flexibility with Rigor

DML IRM achieves what was previously thought impossible: using highly flexible machine learning models (random forests, gradient boosting, neural networks) while maintaining valid statistical inference with proper confidence intervals and p-values. Traditional parametric methods require restrictive functional form assumptions, while naive ML approaches sacrifice inferential validity.

2. Doubly Robust Protection

The IRM score functions (AIPW/DR scores) provide double robustness: the causal estimate remains consistent if either the outcome model $\mathbb{E}[Y|X,D]$ or the propensity score $\mathbb{E}[D|X]$ is correctly specified. You don't need both models to be perfect—getting one right is sufficient. This is a massive safety net that traditional regression lacks.

3. Neyman Orthogonality: Immunity to Regularization Bias

The key innovation is orthogonalized moment conditions that make the causal parameter estimate insensitive to small errors in nuisance function estimation. Even if your ML models for $g(D,X)$ and $m(X)$ have regularization bias or aren't perfectly converging to the truth, the treatment effect estimate maintains its $\sqrt{n}$ -convergence rate and asymptotic normality. This is formalized through:

$\frac{\partial}{\partial \eta}\mathbb{E}[\psi(W;\theta_0,\eta_0)] = 0$

This orthogonality condition ensures first-order insensitivity to nuisance estimation errors.

4. Cross-Fitting Eliminates Overfitting Bias

Sample splitting (cross-fitting) ensures that:

The same data is never used for both training nuisance models and computing scores
Out-of-sample predictions prevent overfitting-induced bias
Multiple splits with averaging reduce sampling variability

This makes DML robust even with complex, high-capacity learners that would otherwise overfit.

5. High-Dimensional Capability

DML IRM naturally handles high-dimensional confounders ( $p \gg n$ scenarios) through:

Modern ML methods (LASSO, random forests, boosting, neural nets) for nuisance estimation
Theoretical guarantees that hold under sparsity or approximate sparsity
No need to manually select which confounders to include

6. Unified Framework for Multiple Estimands

The same IRM framework seamlessly estimates:

ATE: Average treatment effect across the entire population
ATTE/ATT: Effect on the treated (more policy-relevant in many contexts)
CATE: Conditional effects for heterogeneity analysis
GATE: Group average treatment effects

Simply by changing the score function, you get valid inference for different causal questions.

7. Built-In Diagnostic Framework

Modern DML IRM implementations provide comprehensive diagnostics:

Overlap/positivity checks: Ensure common support between treated and control
Orthogonality validation: Verify Neyman orthogonality via out-of-sample moment checks
Sensitivity analysis: Assess robustness to potential unobserved cofounding
Score diagnostics: Check for extreme influence values or distributional issues

8. Proven Theoretical Foundation

DML rests on rigorous semiparametric theory (Chernozhukov et al., 2018):

Formal asymptotic normality results
Convergence rates under weak assumptions
Uniform inference guarantees
Multiple robustness properties

9. Practical Performance

Empirical studies consistently show DML IRM outperforms alternatives:

Better finite-sample coverage of confidence intervals than standard regression
Lower bias than single-model approaches
More stable estimates than purely parametric methods in complex settings
Competitive or superior to propensity score methods while being more automatic

10. Transparent Cofounding Adjustment

Unlike black-box methods, DML IRM makes explicit:

What confounders are being adjusted for
How the adjustment is performed (via outcome and propensity models)
The quality of overlap and model fit
The influence function decomposition showing where the estimate comes from

When DML IRM Shines

DML IRM is particularly powerful for:

Observational studies with rich covariate sets
Digital experiments with high-dimensional user features
Policy evaluation where selection into treatment depends on many factors
Heterogeneous effects requiring flexible modeling
Business applications demanding both accuracy and valid uncertainty quantification

Key Reference: Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.

Math of DML IRM

Observed data: $(Y_i, D_i, X_i)_{i=1}^n$ .
- $Y\in\mathbb R$ (or $\{0,1\}$ ) is the outcome.
- $D\in\{0,1\}$ is a binary treatment.
- $X\in\mathbb R^p$ are observed confounders.
Potential outcomes: $Y(1), Y(0)$ .
Targets:
- ATE: $\theta_{\mathrm{ATE}}=\mathbb E\big[Y(1)-Y(0)\big]$ .
- ATTE: $\theta_{\mathrm{ATTE}}=\mathbb E\big[Y(1)-Y(0)\mid D=1\big]$ .

Assumptions:

Uncofoundedness $(Y(1),Y(0))\perp D\mid X$
Positivity $0<\Pr(D=1\mid X)<1$
SUTVA
Regularity for cross-fitting and ML.

Scores

u_1 = Y-\hat g_1,\quad u_0 = Y-\hat g_0,\quad h_1 = \frac{D}{\hat m},\quad h_0 = \frac{1-D}{1-\hat m},\quad \hat p_1=\mathbb{E}_n[D].

ATE score (AIPW/DR)

\psi_a^{\mathrm{ATE}}=-1,\qquad \psi_b^{\mathrm{ATE}}=(\hat g_1-\hat g_0)+u_1h_1-u_0h_0.

Estimator and per-unit influence value:

\hat\theta_{\mathrm{ATE}}=\mathbb{E}n\big[\psi_b^{\mathrm{ATE}}\big],\qquad \hat\psi_i^{\mathrm{ATE}}=\psi_{b,i}^{\mathrm{ATE}}-\hat\theta_{\mathrm{ATE}}.

ATTE score

Let $p_1=\mathbb{E}[D]$ (estimate with $\hat p_1$ ). Then

\psi_a^{\mathrm{ATTE}}=-\frac{D}{p_1},\qquad \psi_b^{\mathrm{ATTE}}=\frac{D}{p_1}(\hat g_1-\hat g_0) +\frac{D}{p_1}(Y-\hat g_1) -\frac{1-D}{p_1}\frac{\hat m}{1-\hat m}(Y-\hat g_0).

Estimator and per-unit influence value:

\hat\theta_{\mathrm{ATTE}}=\mathbb{E}n\big[\psi_b^{\mathrm{ATTE}}\big],\qquad \hat\psi_i^{\mathrm{ATTE}}=\psi_{b,i}^{\mathrm{ATTE}}+\psi_{a,i}^{\mathrm{ATTE}}\hat\theta_{\mathrm{ATTE}} \quad(\text{since }\mathbb{E}_n[D/\hat p_1]=1).

Variance & CI (both ATE and ATTE)

For either target, using its corresponding $\hat\psi_i$ :

\widehat{\mathrm{Var}}(\hat\theta)=\frac{1}{n^2}\sum_{i=1}^n \hat\psi_i^2,\qquad \mathrm{se}=\sqrt{\widehat{\mathrm{Var}}(\hat\theta)},\qquad \mathrm{CI}_{1-\alpha}=\hat\theta\pm z_{1-\alpha/2}se