Scenario7 min read

UnconfoundednessModel (IRM)

SUTVA / consistency:

UnconfoundednessModel (IRM)

0) Assumptions

  • SUTVA / consistency:

    • No interference: unit ii’s outcome does not depend on other units’ treatment assignments.
    • No hidden versions: "treatment" and "control" correspond to well-defined interventions.
    • Consistency: the observed outcome equals the potential outcome under the realized treatment: Yi=Yi(Di)Y_i = Y_i(D_i).
  • Unconfoundedness (Conditional Independence Assumption): Potential outcomes are independent of treatment assignment given observed confounders XX: (Y(1),Y(0))DX(Y(1), Y(0)) \perp D \mid X This ensures that after adjusting for XX, there are no unobserved factors affecting both treatment and outcome.

  • Overlap / positivity: The probability of receiving treatment is strictly between 0 and 1 for all XX: 0<Pr(D=1X)<10 < \Pr(D=1 \mid X) < 1 In practice, this is enforced via trimming: m^(X)[ε,1ε]\hat{m}(X) \in [\varepsilon, 1-\varepsilon], where ε\varepsilon is the trimming_threshold.

1) Data + target estimands

You observe i.i.d. units i=1,,ni=1, \dots, n with:

  • outcome YiRY_i \in \mathbb{R},
  • treatment Di{0,1}D_i \in \{0, 1\},
  • confounders XiRpX_i \in \mathbb{R}^p.

Targets:

  • ATE (Average Treatment Effect): τ=E[Yi(1)Yi(0)]\tau = \mathbb{E}[Y_i(1) - Y_i(0)]
  • ATTE (Average Treatment Effect on the Treated): τatt=E[Yi(1)Yi(0)Di=1]\tau_{att} = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i=1]

2) Cross-fitting (Sample Splitting)

The data is partitioned into KK folds {Ik}k=1K\{I_k\}_{k=1}^K (typically n_folds=5). For each fold kk:

  1. Train nuisance models g^0,k,g^1,k,m^k\hat{g}_{0,k}, \hat{g}_{1,k}, \hat{m}_k using all data except fold kk (IkcI_k^c).
  2. Generate out-of-sample predictions for units in fold kk: g^0(Xi)=g^0,k(Xi),g^1(Xi)=g^1,k(Xi),m^(Xi)=m^k(Xi)for iIk\hat{g}_0(X_i) = \hat{g}_{0,k}(X_i), \quad \hat{g}_1(X_i) = \hat{g}_{1,k}(X_i), \quad \hat{m}(X_i) = \hat{m}_k(X_i) \quad \text{for } i \in I_k

3) Nuisance function estimation

The IRM estimator requires three nuisance components:

  • Outcome models: g1(X)=E[YX,D=1],g0(X)=E[YX,D=0]g_1(X) = \mathbb{E}[Y \mid X, D=1], \quad g_0(X) = \mathbb{E}[Y \mid X, D=0]
  • Propensity score: m(X)=Pr(D=1X)m(X) = \Pr(D=1 \mid X)

These are typically estimated using ML models like CatBoost or Random Forest. Propensity scores are clipped to [ε,1ε][\varepsilon, 1-\varepsilon] to ensure stability.


4) Scores and Moment Equations (AIPW/DR)

Define residuals: u1=Yg^1,u0=Yg^0u_1 = Y - \hat{g}_1, \quad u_0 = Y - \hat{g}_0

Define (optionally normalized) IPW terms: h1=Dm^,h0=1D1m^h_1 = \frac{D}{\hat{m}}, \quad h_0 = \frac{1-D}{1-\hat{m}}

The estimator solves the sample moment equation En[ψaθ+ψb]=0\mathbb{E}_n[\psi_a \theta + \psi_b] = 0.

  • ATE score (Doubly Robust): ψa=1,ψb=(g^1g^0)+u1h1u0h0\psi_a = -1, \quad \psi_b = (\hat{g}_1 - \hat{g}_0) + u_1 h_1 - u_0 h_0 If normalize_ipw=True (Hájek), h1h_1 and h0h_0 are scaled by their empirical means: hˉ1=h1/En[h1]\bar{h}_1 = h_1 / \mathbb{E}_n[h_1].

  • ATTE score: Let p1=E[D]p_1 = \mathbb{E}[D]. ψa=Dp1,ψb=Dp1(g^1g^0)+Dp1u11Dp1m^1m^u0\psi_a = -\frac{D}{p_1}, \quad \psi_b = \frac{D}{p_1}(\hat{g}_1 - \hat{g}_0) + \frac{D}{p_1}u_1 - \frac{1-D}{p_1} \frac{\hat{m}}{1-\hat{m}} u_0


5) Point estimate and Influence Functions

The treatment effect estimate θ^\hat{\theta} is the solution to the moment equation: θ^=En[ψb]En[ψa]\hat{\theta} = -\frac{\mathbb{E}_n[\psi_b]}{\mathbb{E}_n[\psi_a]}

The Influence Function (IF) for unit ii (which captures the contribution of that unit to the total estimate) is: ψi=ψb,i+ψa,iθ^En[ψa]\psi_i = -\frac{\psi_{b,i} + \psi_{a,i} \hat{\theta}}{\mathbb{E}_n[\psi_a]}

By construction, En[ψi]=0\mathbb{E}_n[\psi_i] = 0.


6) Variance and Robust Standard Errors

The asymptotic variance is estimated as the sample variance of the influence function: Var^(θ^)=1n2i=1nψi2\widehat{\mathrm{Var}}(\hat{\theta}) = \frac{1}{n^2} \sum_{i=1}^n \psi_i^2 The standard error is: SE^(θ^)=Var^(θ^)\widehat{\mathrm{SE}}(\hat{\theta}) = \sqrt{\widehat{\mathrm{Var}}(\hat{\theta})}

Confidence Interval (at level 1α1-\alpha): CIabs=θ^±z1α/2SE^(θ^)\mathrm{CI}_{abs} = \hat{\theta} \pm z_{1-\alpha/2} \widehat{\mathrm{SE}}(\hat{\theta})


7) Relative effect (Delta method)

The relative effect (%) is defined as: τ^rel=100θ^μ^c\hat{\tau}_{rel} = 100 \cdot \frac{\hat{\theta}}{\hat{\mu}_c} where μ^c\hat{\mu}_c is the estimated baseline outcome:

  • ATE: μ^c=En[ψμc]\hat{\mu}_c = \mathbb{E}_n[\psi_{\mu_c}] where ψμc=g^0+u0h0\psi_{\mu_c} = \hat{g}_0 + u_0 h_0
  • ATTE: μ^c=En[ψμc]\hat{\mu}_c = \mathbb{E}_n[\psi_{\mu_c}] where ψμc=Dp1g^0+1Dp1m^1m^u0\psi_{\mu_c} = \frac{D}{p_1}\hat{g}_0 + \frac{1-D}{p_1} \frac{\hat{m}}{1-\hat{m}} u_0

Let ψ^μ,i=ψμc,iμ^c\hat{\psi}_{\mu, i} = \psi_{\mu_c, i} - \hat{\mu}_c. The IF for the relative effect (using the delta method) is: ψrel,i=100(ψiμ^cθ^ψ^μ,iμ^c2)\psi_{rel, i} = 100 \cdot \left( \frac{\psi_i}{\hat{\mu}_c} - \frac{\hat{\theta} \hat{\psi}_{\mu, i}}{\hat{\mu}_c^2} \right)

This formulation correctly handles the covariance between the treatment effect and the baseline mean. SE^(τ^rel)=1n2ψrel,i2\widehat{\mathrm{SE}}(\hat{\tau}_{rel}) = \sqrt{\frac{1}{n^2} \sum \psi_{rel, i}^2}


8) Math pseudocode

code.text

References

1) IRM / DML core: cross-fitting + orthogonal scores (your fit() + psi_a/psi_b moment equation)

  • Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins (2018)Double/debiased machine learning for treatment and structural parameters (Econometrics Journal). This is the canonical DML reference that (i) formalizes Neyman-orthogonal scores, (ii) motivates and analyzes cross-fitting, and (iii) includes the IRM score for ATE/ATT under unconfoundedness (exactly the structure your psi_b is implementing). (OUP Academic)

  • Chernozhukov et al. (2016/2017) earlier accessible versions / summaries of the same DML program (useful if you want an arXiv cite alongside the journal cite). (OUP Academic)

  • Newey & McFadden (1994) — large-sample theory for M/Z-estimators (solving moments, Jacobian (J), influence function / sandwich variance). This matches your _solve_moment_equation() pattern: (\hat\theta) from an estimating equation + IF (= -(\psi(\hat\theta))/J). (Cambridge Assets)


2) ATE / ATTE efficient influence functions and doubly-robust/AIPW structure (your psi_b form)

  • Hahn (1998)On the role of the propensity score in efficient semiparametric estimation of average treatment effects (Econometrica). Derives efficiency bounds and EIF structure; also highlights that ATE vs ATT behave differently (your ATTE weighting structure aligns with this literature). (EconPapers)

  • Hirano, Imbens & Ridder (2003) — efficient estimation of ATE with (estimated) propensity score; foundational for modern IPW/AIPW practice under unconfoundedness. (NBER)

  • Bang & Robins (2005) — classic doubly-robust estimators for missing-data/causal models (Biometrics). This is the standard cite for the DR/AIPW form you’re using (outcome regression + propensity correction). (Wiley Online Library)

  • Rosenbaum & Rubin (1983) — propensity score + ignorability framework (the identifying assumption your IRM scenario relies on). (OUP Academic)


3) IPW ancestry + Hájek (normalized) vs Horvitz–Thompson (your normalize_ipw option and the warning about “ratio-style” inference)

  • Horvitz & Thompson (1952) — the original HT inverse-probability weighting estimator lineage. This is the standard historical/technical cite for IPW. (CMU Stats & Data Science)

  • Hájek ratio/normalized estimator — “Hájek” normalization is the ratio form (normalize weights / denominators). A clean modern citation that explicitly discusses the Hájek ratio estimator (and uses the term in print) is Aronow et al. (2017, AOAS) while referencing Hájek’s original work. (Project Euclid) (If you want a variance-estimation focused reference for Hájek-style ratios, this also exists in applied survey settings.) (ERIC)


4) Trimming / overlap (your _clip_propensity + trimming_threshold)

  • Crump, Hotz, Imbens & Mitnik (2009)Dealing with limited overlap… (Biometrika). This is the go-to cite for why trimming/extreme propensities matter and for principled trimming rules. (Duke Economics)

5) GATE / BLP on an orthogonal signal (your gate() that regresses orth_signal on group basis)

  • Chernozhukov, Demirer, Duflo, Fernández-Val (2018/2020)Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments (widely circulated working paper/PDF). This is the standard cite for BLP / GATES style inference built on an orthogonalized score/pseudo-outcome. (Taylor & Francis Online)

(Your exact setup—compute an orthogonal signal, then do a linear projection on a basis of group indicators—is directly in this family.)


6) Sensitivity analysis with (R^2)-style parameters (your sensitivity_analysis() + _sensitivity_element_est() objects like Riesz representer / (\nu^2))

  • Chernozhukov, Cinelli, Newey, Sharma, Syrgkanis (NBER 2022; arXiv 2021/2024 versions)Omitted Variable Bias in Causal Machine Learning (includes the “long story short” version; develops sensitivity/OVB theory using (R^2)-type parameterizations and constructs the key components that show up as “sensitivity elements”). (NBER)