Scenario5 min read

MultiTreatmentIRM

SUTVA / consistency

MultiTreatmentIRM

0) Assumptions

  • SUTVA / consistency

    • No interference across units.
    • No hidden treatment versions.
    • Observed outcome matches the potential outcome under realized arm.
  • Multi-arm unconfoundedness

(Y(0),Y(1),,Y(K1))DX \left(Y(0), Y(1), \dots, Y(K-1)\right)\perp D \mid X

where DD is one-hot with KK treatment columns (column 0 is baseline/control).

  • Positivity / overlap
Pr(Dk=1X=x)>0,k{0,,K1} \Pr(D_k=1\mid X=x)>0,\quad \forall k\in\{0,\dots,K-1\}

and in implementation this is stabilized by multiclass trimming.

  • Fold-level support for cross-fitting Each training fold must contain all treatment arms; otherwise nuisance models for missing arms are not identifiable.

1) Data and estimand

For i.i.d. units i=1,,ni=1,\dots,n, observe:

  • outcome: YiRY_i\in\mathbb{R},
  • confounders: XiRpX_i\in\mathbb{R}^p,
  • one-hot treatment vector: Di=(Di0,,Di,K1)D_i=(D_{i0},\dots,D_{i,K-1}) with kDik=1\sum_k D_{ik}=1.

Target is a vector of baseline contrasts:

θ=[θ1θK1],θk=E[Y(k)Y(0)], k=1,,K1.\theta = \begin{bmatrix} \theta_1\\ \vdots\\ \theta_{K-1} \end{bmatrix}, \qquad \theta_k = \mathbb{E}[Y(k)-Y(0)],\ k=1,\dots,K-1.

2) Nuisance functions

For each arm kk:

gk(x)=E[YX=x,Dk=1].g_k(x)=\mathbb{E}[Y\mid X=x, D_k=1].

Generalized propensity (multiclass):

mk(x)=Pr(Dk=1X=x),k=0K1mk(x)=1.m_k(x)=\Pr(D_k=1\mid X=x),\quad \sum_{k=0}^{K-1}m_k(x)=1.

Cross-fitted predictions are denoted g^k(Xi)\hat g_k(X_i) and m^k(Xi)\hat m_k(X_i).


3) Cross-fitting

Split sample into folds I1,,IFI_1,\dots,I_F (n_folds=F). For each fold ff:

  1. Train multiclass propensity model on IfcI_f^c and predict m^k\hat m_k on IfI_f.
  2. For each arm kk, train outcome model only on rows in IfcI_f^c with Dk=1D_k=1, then predict g^k\hat g_k on IfI_f.

Binary-outcome edge case used in implementation:

If within a fold+arm the training outcome is single-class (all 0 or all 1), use a deterministic constant predictor for that arm/fold instead of fitting a classifier.


4) Trimming in multi-class setup

Implementation uses lower-bound trimming + row renormalization:

m~ik=max(m^ik,ε),m^iktrim=m~ikj=0K1m~ij.\tilde m_{ik}=\max(\hat m_{ik}, \varepsilon),\qquad \hat m^{trim}_{ik}=\frac{\tilde m_{ik}}{\sum_{j=0}^{K-1}\tilde m_{ij}}.

So each row stays a valid probability simplex after trimming.
Constraint: ε<1/K\varepsilon < 1/K.


5) Orthogonal score for multi-arm ATE

Define residuals for each arm:

uik=Yig^k(Xi),k=0,,K1.u_{ik}=Y_i-\hat g_k(X_i),\quad k=0,\dots,K-1.

Define IPW representers:

hik=Dikm^iktrim.h_{ik}=\frac{D_{ik}}{\hat m^{trim}_{ik}}.

If normalize_ipw=True, apply column-wise Hájek normalization:

hikhikEn[hk].h_{ik}\leftarrow \frac{h_{ik}}{\mathbb{E}_n[h_{\cdot k}]}.

For each contrast k=1,,K1k=1,\dots,K-1:

ψb,ik=(g^k(Xi)g^0(Xi))+uikhikui0hi0.\psi_{b,ik} = \big(\hat g_k(X_i)-\hat g_0(X_i)\big) +u_{ik}h_{ik} -u_{i0}h_{i0}.

Moment system:

En[ψaθ+ψb]=0,ψa=1.\mathbb{E}_n[\psi_a\theta+\psi_b]=0,\qquad \psi_a=-1.

Thus with J=En[ψa]=1J=\mathbb{E}_n[\psi_a]=-1:

θ^k=En[ψb,k]J=En[ψb,k].\hat\theta_k = -\frac{\mathbb{E}_n[\psi_{b,\cdot k}]}{J} =\mathbb{E}_n[\psi_{b,\cdot k}].

6) Influence function and inference

Per-contrast influence function:

IF^ik=ψb,ik+ψaθ^kJ=ψb,ikθ^k.\widehat{IF}_{ik} = -\frac{\psi_{b,ik}+\psi_a\hat\theta_k}{J} = \psi_{b,ik}-\hat\theta_k.

Variance/SE:

Var^(θ^k)=Varn(IF^k)n,SE^(θ^k)=Var^(θ^k).\widehat{\mathrm{Var}}(\hat\theta_k)=\frac{\mathrm{Var}_n(\widehat{IF}_{\cdot k})}{n}, \qquad \widehat{SE}(\hat\theta_k)=\sqrt{\widehat{\mathrm{Var}}(\hat\theta_k)}.

Wald interval:

CIkabs=θ^k±z1α/2SE^(θ^k).CI_k^{abs} = \hat\theta_k \pm z_{1-\alpha/2}\widehat{SE}(\hat\theta_k).

P-values are computed per contrast via normal approximation; significance flag uses Bonferroni:

pk<αK1.p_k < \frac{\alpha}{K-1}.

7) Relative effect

Baseline orthogonal signal:

ψμc,i=g^0(Xi)+ui0hi0,μ^c=En[ψμc].\psi_{\mu_c,i} = \hat g_0(X_i)+u_{i0}h_{i0}, \qquad \hat\mu_c=\mathbb{E}_n[\psi_{\mu_c}].

Relative effect (%):

τ^krel=100θ^kμ^c.\hat\tau_k^{rel}=100\cdot \frac{\hat\theta_k}{\hat\mu_c}.

Implementation uses a delta-style plug-in variance:

dθ=100μ^c,dμ,k=100θ^kμ^c2,d_\theta=\frac{100}{\hat\mu_c},\qquad d_{\mu,k}=-\frac{100\hat\theta_k}{\hat\mu_c^2}, Var^(τ^krel)dθ2SE^(θ^k)2+dμ,k2SE^(μ^c)2+2dθdμ,kCov^(θ^k,μ^c),\widehat{\mathrm{Var}}(\hat\tau_k^{rel}) \approx d_\theta^2\widehat{SE}(\hat\theta_k)^2 +d_{\mu,k}^2\widehat{SE}(\hat\mu_c)^2 +2d_\theta d_{\mu,k}\widehat{\mathrm{Cov}}(\hat\theta_k,\hat\mu_c),

then

CIkrel=τ^krel±z1α/2SE^(τ^krel).CI_k^{rel}=\hat\tau_k^{rel}\pm z_{1-\alpha/2}\widehat{SE}(\hat\tau_k^{rel}).

8) Math pseudocode

code.text

References

Foundations: propensity score + unconfoundedness

  • Rosenbaum & Rubin (1983) — introduces the propensity score as a balancing score under ignorability (your whole “(D \perp (Y(d)) \mid X)” setup). (OUP Academic)
  • Hahn (1998) — semiparametric efficiency / influence-function view of ATE estimators using propensity scores and outcome regression (conceptual basis for IF-based SEs). (JSTOR)

Doubly-robust / AIPW (your score psi_b)

  • Robins, Rotnitzky & Zhao (1994) — classic IPW estimating equations paper; widely cited as the origin of AIPW-style augmentation logic. (Taylor & Francis Online)
  • Bang & Robins (2005) — formal “doubly robust” theory: consistency if either outcome model or propensity model is correct (your estimator’s key robustness property). (PubMed)
  • Glynn & Quinn (2010) — practitioner-friendly AIPW exposition (good to cite in docs as an accessible reference). (UC Berkeley Law)

Multi-valued / multi-treatment propensity score & effects (your (K\ge2) one-hot setting)

  • Imbens (2000) — extends propensity score ideas to multi-valued treatments / dose-response; canonical reference for generalized propensity score logic. (JSTOR)
  • Lechner (1999/IZA DP 91) — identification and estimation under CIA with multiple mutually exclusive treatments (balancing scores beyond binary). (Econstor)
  • Lopez & Gutman (2017, Stat Sci) — review + methods for categorical multiple treatments (matching/weighting/regression variants; good for positioning your approach). (arXiv)

Cross-fitting + Neyman-orthogonal / DoubleML framing (your “DoubleML-style cross-fitting” claim)

  • Chernozhukov et al. (2018, Econometrics Journal) — the modern reference for Neyman-orthogonal scores + cross-fitting delivering (\sqrt{n}) inference with ML nuisances. (OUP Academic)
  • Chernozhukov et al. (2016, arXiv:1608.00060) — earlier/longer technical version focused on treatment/causal parameters (often cited in implementations). (arXiv)
  • (Optional efficiency detail) Hirano, Imbens & Ridder (2003) — shows efficiency gains/conditions when using an estimated propensity score; useful background for the weighting component. (Wiley Online Library)

Sensitivity analysis (your cf_y, r2_d, rho style)

  • Cinelli & Hazlett (2020, JRSSB) — modern sensitivity analysis framed via omitted-variable bias with partial (R^2) style parameters (matches your r2_d / confounding-strength parameterization). (carloscinelli.com)
  • Cinelli, Ferwerda & Hazlett (sensemakr paper) — practical companion describing the implemented sensitivity summaries/statistics. (carloscinelli.com)
  • (Alternative tradition) Oster (2019, J Business & Econ Stats) — coefficient-stability approach (different parameterization, but often cited alongside Cinelli–Hazlett in “how sensitive is this?” docs). (IDEAS/RePEc)