UnconfoundednessModel (IRM)
0) Assumptions
-
SUTVA / consistency:
- No interference: unit ’s outcome does not depend on other units’ treatment assignments.
- No hidden versions: "treatment" and "control" correspond to well-defined interventions.
- Consistency: the observed outcome equals the potential outcome under the realized treatment: .
-
Unconfoundedness (Conditional Independence Assumption): Potential outcomes are independent of treatment assignment given observed confounders : This ensures that after adjusting for , there are no unobserved factors affecting both treatment and outcome.
-
Overlap / positivity: The probability of receiving treatment is strictly between 0 and 1 for all : In practice, this is enforced via trimming: , where is the
trimming_threshold.
1) Data + target estimands
You observe i.i.d. units with:
- outcome ,
- treatment ,
- confounders .
Targets:
- ATE (Average Treatment Effect):
- ATTE (Average Treatment Effect on the Treated):
2) Cross-fitting (Sample Splitting)
The data is partitioned into folds (typically n_folds=5). For each fold :
- Train nuisance models using all data except fold ().
- Generate out-of-sample predictions for units in fold :
3) Nuisance function estimation
The IRM estimator requires three nuisance components:
- Outcome models:
- Propensity score:
These are typically estimated using ML models like CatBoost or Random Forest. Propensity scores are clipped to to ensure stability.
4) Scores and Moment Equations (AIPW/DR)
Define residuals:
Define (optionally normalized) IPW terms:
The estimator solves the sample moment equation .
-
ATE score (Doubly Robust): If
normalize_ipw=True(Hájek), and are scaled by their empirical means: . -
ATTE score: Let .
5) Point estimate and Influence Functions
The treatment effect estimate is the solution to the moment equation:
The Influence Function (IF) for unit (which captures the contribution of that unit to the total estimate) is:
By construction, .
6) Variance and Robust Standard Errors
The asymptotic variance is estimated as the sample variance of the influence function: The standard error is:
Confidence Interval (at level ):
7) Relative effect (Delta method)
The relative effect (%) is defined as: where is the estimated baseline outcome:
- ATE: where
- ATTE: where
Let . The IF for the relative effect (using the delta method) is:
This formulation correctly handles the covariance between the treatment effect and the baseline mean.
8) Math pseudocode
References
1) IRM / DML core: cross-fitting + orthogonal scores (your fit() + psi_a/psi_b moment equation)
-
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins (2018) — Double/debiased machine learning for treatment and structural parameters (Econometrics Journal). This is the canonical DML reference that (i) formalizes Neyman-orthogonal scores, (ii) motivates and analyzes cross-fitting, and (iii) includes the IRM score for ATE/ATT under unconfoundedness (exactly the structure your
psi_bis implementing). (OUP Academic) -
Chernozhukov et al. (2016/2017) earlier accessible versions / summaries of the same DML program (useful if you want an arXiv cite alongside the journal cite). (OUP Academic)
-
Newey & McFadden (1994) — large-sample theory for M/Z-estimators (solving moments, Jacobian (J), influence function / sandwich variance). This matches your
_solve_moment_equation()pattern: (\hat\theta) from an estimating equation + IF (= -(\psi(\hat\theta))/J). (Cambridge Assets)
2) ATE / ATTE efficient influence functions and doubly-robust/AIPW structure (your psi_b form)
-
Hahn (1998) — On the role of the propensity score in efficient semiparametric estimation of average treatment effects (Econometrica). Derives efficiency bounds and EIF structure; also highlights that ATE vs ATT behave differently (your ATTE weighting structure aligns with this literature). (EconPapers)
-
Hirano, Imbens & Ridder (2003) — efficient estimation of ATE with (estimated) propensity score; foundational for modern IPW/AIPW practice under unconfoundedness. (NBER)
-
Bang & Robins (2005) — classic doubly-robust estimators for missing-data/causal models (Biometrics). This is the standard cite for the DR/AIPW form you’re using (outcome regression + propensity correction). (Wiley Online Library)
-
Rosenbaum & Rubin (1983) — propensity score + ignorability framework (the identifying assumption your IRM scenario relies on). (OUP Academic)
3) IPW ancestry + Hájek (normalized) vs Horvitz–Thompson (your normalize_ipw option and the warning about “ratio-style” inference)
-
Horvitz & Thompson (1952) — the original HT inverse-probability weighting estimator lineage. This is the standard historical/technical cite for IPW. (CMU Stats & Data Science)
-
Hájek ratio/normalized estimator — “Hájek” normalization is the ratio form (normalize weights / denominators). A clean modern citation that explicitly discusses the Hájek ratio estimator (and uses the term in print) is Aronow et al. (2017, AOAS) while referencing Hájek’s original work. (Project Euclid) (If you want a variance-estimation focused reference for Hájek-style ratios, this also exists in applied survey settings.) (ERIC)
4) Trimming / overlap (your _clip_propensity + trimming_threshold)
- Crump, Hotz, Imbens & Mitnik (2009) — Dealing with limited overlap… (Biometrika). This is the go-to cite for why trimming/extreme propensities matter and for principled trimming rules. (Duke Economics)
5) GATE / BLP on an orthogonal signal (your gate() that regresses orth_signal on group basis)
- Chernozhukov, Demirer, Duflo, Fernández-Val (2018/2020) — Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments (widely circulated working paper/PDF). This is the standard cite for BLP / GATES style inference built on an orthogonalized score/pseudo-outcome. (Taylor & Francis Online)
(Your exact setup—compute an orthogonal signal, then do a linear projection on a basis of group indicators—is directly in this family.)
6) Sensitivity analysis with (R^2)-style parameters (your sensitivity_analysis() + _sensitivity_element_est() objects like Riesz representer / (\nu^2))
- Chernozhukov, Cinelli, Newey, Sharma, Syrgkanis (NBER 2022; arXiv 2021/2024 versions) — Omitted Variable Bias in Causal Machine Learning (includes the “long story short” version; develops sensitivity/OVB theory using (R^2)-type parameterizations and constructs the key components that show up as “sensitivity elements”). (NBER)