MultiTreatmentIRM

0) Assumptions

SUTVA / consistency
- No interference across units.
- No hidden treatment versions.
- Observed outcome matches the potential outcome under realized arm.
Multi-arm unconfoundedness

\left(Y(0), Y(1), \dots, Y(K-1)\right)\perp D \mid X

where $D$ is one-hot with $K$ treatment columns (column 0 is baseline/control).

Positivity / overlap

\Pr(D_k=1\mid X=x)>0,\quad \forall k\in\{0,\dots,K-1\}

and in implementation this is stabilized by multiclass trimming.

Fold-level support for cross-fitting Each training fold must contain all treatment arms; otherwise nuisance models for missing arms are not identifiable.

1) Data and estimand

For i.i.d. units $i=1,\dots,n$ , observe:

outcome: $Y_i\in\mathbb{R}$ ,
confounders: $X_i\in\mathbb{R}^p$ ,
one-hot treatment vector: $D_i=(D_{i0},\dots,D_{i,K-1})$ with $\sum_k D_{ik}=1$ .

Target is a vector of baseline contrasts:

\theta = \begin{bmatrix} \theta_1\\ \vdots\\ \theta_{K-1} \end{bmatrix}, \qquad \theta_k = \mathbb{E}[Y(k)-Y(0)],\ k=1,\dots,K-1.

2) Nuisance functions

For each arm $k$ :

g_k(x)=\mathbb{E}[Y\mid X=x, D_k=1].

Generalized propensity (multiclass):

m_k(x)=\Pr(D_k=1\mid X=x),\quad \sum_{k=0}^{K-1}m_k(x)=1.

Cross-fitted predictions are denoted $\hat g_k(X_i)$ and $\hat m_k(X_i)$ .

3) Cross-fitting

Split sample into folds $I_1,\dots,I_F$ (n_folds=F). For each fold $f$ :

Train multiclass propensity model on $I_f^c$ and predict $\hat m_k$ on $I_f$ .
For each arm $k$ , train outcome model only on rows in $I_f^c$ with $D_k=1$ , then predict $\hat g_k$ on $I_f$ .

Binary-outcome edge case used in implementation:

If within a fold+arm the training outcome is single-class (all 0 or all 1), use a deterministic constant predictor for that arm/fold instead of fitting a classifier.

4) Trimming in multi-class setup

Implementation uses lower-bound trimming + row renormalization:

\tilde m_{ik}=\max(\hat m_{ik}, \varepsilon),\qquad \hat m^{trim}_{ik}=\frac{\tilde m_{ik}}{\sum_{j=0}^{K-1}\tilde m_{ij}}.

So each row stays a valid probability simplex after trimming.
Constraint: $\varepsilon < 1/K$ .

5) Orthogonal score for multi-arm ATE

Define residuals for each arm:

u_{ik}=Y_i-\hat g_k(X_i),\quad k=0,\dots,K-1.

Define IPW representers:

h_{ik}=\frac{D_{ik}}{\hat m^{trim}_{ik}}.

If normalize_ipw=True, apply column-wise Hájek normalization:

h_{ik}\leftarrow \frac{h_{ik}}{\mathbb{E}_n[h_{\cdot k}]}.

For each contrast $k=1,\dots,K-1$ :

\psi_{b,ik} = \big(\hat g_k(X_i)-\hat g_0(X_i)\big) +u_{ik}h_{ik} -u_{i0}h_{i0}.

Moment system:

\mathbb{E}_n[\psi_a\theta+\psi_b]=0,\qquad \psi_a=-1.

Thus with $J=\mathbb{E}_n[\psi_a]=-1$ :

\hat\theta_k = -\frac{\mathbb{E}_n[\psi_{b,\cdot k}]}{J} =\mathbb{E}_n[\psi_{b,\cdot k}].

6) Influence function and inference

Per-contrast influence function:

\widehat{IF}_{ik} = -\frac{\psi_{b,ik}+\psi_a\hat\theta_k}{J} = \psi_{b,ik}-\hat\theta_k.

Variance/SE:

\widehat{\mathrm{Var}}(\hat\theta_k)=\frac{\mathrm{Var}_n(\widehat{IF}_{\cdot k})}{n}, \qquad \widehat{SE}(\hat\theta_k)=\sqrt{\widehat{\mathrm{Var}}(\hat\theta_k)}.

Wald interval:

CI_k^{abs} = \hat\theta_k \pm z_{1-\alpha/2}\widehat{SE}(\hat\theta_k).

P-values are computed per contrast via normal approximation; significance flag uses Bonferroni:

p_k < \frac{\alpha}{K-1}.

7) Relative effect

Baseline orthogonal signal:

\psi_{\mu_c,i} = \hat g_0(X_i)+u_{i0}h_{i0}, \qquad \hat\mu_c=\mathbb{E}_n[\psi_{\mu_c}].

Relative effect (%):

\hat\tau_k^{rel}=100\cdot \frac{\hat\theta_k}{\hat\mu_c}.

Implementation uses a delta-style plug-in variance:

d_\theta=\frac{100}{\hat\mu_c},\qquad d_{\mu,k}=-\frac{100\hat\theta_k}{\hat\mu_c^2},

\widehat{\mathrm{Var}}(\hat\tau_k^{rel}) \approx d_\theta^2\widehat{SE}(\hat\theta_k)^2 +d_{\mu,k}^2\widehat{SE}(\hat\mu_c)^2 +2d_\theta d_{\mu,k}\widehat{\mathrm{Cov}}(\hat\theta_k,\hat\mu_c),

then

CI_k^{rel}=\hat\tau_k^{rel}\pm z_{1-\alpha/2}\widehat{SE}(\hat\tau_k^{rel}).

8) Math pseudocode

code.text

References

Foundations: propensity score + unconfoundedness

Rosenbaum & Rubin (1983) — introduces the propensity score as a balancing score under ignorability (your whole “(D \perp (Y(d)) \mid X)” setup). (OUP Academic)
Hahn (1998) — semiparametric efficiency / influence-function view of ATE estimators using propensity scores and outcome regression (conceptual basis for IF-based SEs). (JSTOR)

Doubly-robust / AIPW (your score `psi_b`)

Robins, Rotnitzky & Zhao (1994) — classic IPW estimating equations paper; widely cited as the origin of AIPW-style augmentation logic. (Taylor & Francis Online)
Bang & Robins (2005) — formal “doubly robust” theory: consistency if either outcome model or propensity model is correct (your estimator’s key robustness property). (PubMed)
Glynn & Quinn (2010) — practitioner-friendly AIPW exposition (good to cite in docs as an accessible reference). (UC Berkeley Law)

Multi-valued / multi-treatment propensity score & effects (your (K\ge2) one-hot setting)

Imbens (2000) — extends propensity score ideas to multi-valued treatments / dose-response; canonical reference for generalized propensity score logic. (JSTOR)
Lechner (1999/IZA DP 91) — identification and estimation under CIA with multiple mutually exclusive treatments (balancing scores beyond binary). (Econstor)
Lopez & Gutman (2017, Stat Sci) — review + methods for categorical multiple treatments (matching/weighting/regression variants; good for positioning your approach). (arXiv)

Cross-fitting + Neyman-orthogonal / DoubleML framing (your “DoubleML-style cross-fitting” claim)

Chernozhukov et al. (2018, Econometrics Journal) — the modern reference for Neyman-orthogonal scores + cross-fitting delivering (\sqrt{n}) inference with ML nuisances. (OUP Academic)
Chernozhukov et al. (2016, arXiv:1608.00060) — earlier/longer technical version focused on treatment/causal parameters (often cited in implementations). (arXiv)
(Optional efficiency detail) Hirano, Imbens & Ridder (2003) — shows efficiency gains/conditions when using an estimated propensity score; useful background for the weighting component. (Wiley Online Library)

Sensitivity analysis (your `cf_y`, `r2_d`, `rho` style)

Cinelli & Hazlett (2020, JRSSB) — modern sensitivity analysis framed via omitted-variable bias with partial (R^2) style parameters (matches your r2_d / confounding-strength parameterization). (carloscinelli.com)
Cinelli, Ferwerda & Hazlett (sensemakr paper) — practical companion describing the implemented sensitivity summaries/statistics. (carloscinelli.com)
(Alternative tradition) Oster (2019, J Business & Econ Stats) — coefficient-stability approach (different parameterization, but often cited alongside Cinelli–Hazlett in “how sensitive is this?” docs). (IDEAS/RePEc)

MultiTreatmentIRM

MultiTreatmentIRM

0) Assumptions

1) Data and estimand

2) Nuisance functions

3) Cross-fitting

4) Trimming in multi-class setup

5) Orthogonal score for multi-arm ATE

6) Influence function and inference

7) Relative effect

8) Math pseudocode

References

Foundations: propensity score + unconfoundedness

Doubly-robust / AIPW (your score psi_b)

Multi-valued / multi-treatment propensity score & effects (your (K\ge2) one-hot setting)

Cross-fitting + Neyman-orthogonal / DoubleML framing (your “DoubleML-style cross-fitting” claim)

Sensitivity analysis (your cf_y, r2_d, rho style)

Doubly-robust / AIPW (your score `psi_b`)

Sensitivity analysis (your `cf_y`, `r2_d`, `rho` style)