Skip to content
Scenario5 min read

Math notation for EDA, IRM, and refutation diagnostics (Causalis)

This note summarizes the core notation and formulas used by Causalis’s EDA helpers, the IRM estimator, and refutation diagnostics. The estimator and scores foll...

Math notation for EDA, IRM, and refutation diagnostics (Causalis)

This note summarizes the core notation and formulas used by Causalis’s EDA helpers, the IRM estimator, and refutation diagnostics. The estimator and scores follow the Double/Debiased Machine Learning (DoubleML) formulation; we implement them on top of CausalData. Throughout, En[]\mathbb E_n[\cdot] denotes the sample mean.

1. Variables and parameters

  • Observed data: (Yi,Di,Xi)i=1n(Y_i, D_i, X_i)_{i=1}^n.

    • YRY\in\mathbb R (or {0,1}\{0,1\}) is the outcome.
    • D{0,1}D\in\{0,1\} is a binary treatment.
    • XRpX\in\mathbb R^p are observed confounders.
  • Potential outcomes: Y(1),Y(0)Y(1), Y(0).

  • Targets:

    • ATE: θATE=E[Y(1)Y(0)]\theta_{\mathrm{ATE}}=\mathbb E\big[Y(1)-Y(0)\big].
    • ATT (a.k.a. ATET/ATTE): θATT=E[Y(1)Y(0)D=1]\theta_{\mathrm{ATT}}=\mathbb E\big[Y(1)-Y(0)\mid D=1\big].

Assumptions (standard): Uncofoundedness (Y(1),Y(0))DX(Y(1),Y(0))\perp D\mid X; positivity 0<Pr(D=1X)<10<\Pr(D=1\mid X)<1 a.s.; SUTVA; and regularity for cross-fitting and ML.

2. Nuisance functions (IRM)

  • Propensity: m(x)=Pr(D=1X=x)m(x)=\Pr(D=1\mid X=x).
  • Outcome regressions: g1(x)=E[YX=x,D=1],g0(x)=E[YX=x,D=0]g_1(x)=\mathbb E[Y\mid X=x,D=1],\quad g_0(x)=\mathbb E[Y\mid X=x,D=0].
  • Cross-fitted predictions are denoted m^,g^1,g^0\hat m,\hat g_1,\hat g_0 (length nn).
  • Clipping: m^clip(m^,ε,1ε)\hat m\gets \mathrm{clip}(\hat m,\varepsilon,1-\varepsilon) with user trimming threshold ε>0\varepsilon>0.

Binary outcomes. If YY is binary and the outcome learner is a classifier with predict_proba, Causalis uses the class-1 probability as g^d(x)\hat g_d(x). For numerical stability you may also clip g^d\hat g_d into [δ,1δ][\delta,1-\delta] with a tiny δ\delta (e.g., 10610^{-6}).

3. Scores, EIFs, and estimators

Let u1=Yg^1,  u0=Yg^0u_1=Y-\hat g_1,\; u_0=Y-\hat g_0, and

h1=Dm^,h0=1D1m^.h_1=\frac{D}{\hat m},\qquad h_0=\frac{1-D}{1-\hat m}.

(Optionally, Causalis can normalize h1,h0h_1,h_0 to have sample mean 1; this is a Hájek-style, second-order variance tweak. If you normalize in estimation, also use the normalized influence values in the variance formula.)

3.1 ATE (AIPW/DR)

Score decomposition as E[ψaθ+ψb]=0\mathbb E[\psi_a\,\theta+\psi_b]=0:

ψa=1,ψb=(g^1g^0)+u1h1u0h0.\psi_a=-1,\qquad \psi_b=(\hat g_1-\hat g_0)+u_1 h_1 - u_0 h_0.

Estimator and influence function:

θ^ATE=En[ψb],ψ^i=ψb,iθ^ATE.\hat\theta_{\text{ATE}}=\mathbb E_n[\psi_b],\qquad \hat\psi_i=\psi_{b,i}-\hat\theta_{\text{ATE}}.

Efficient influence function (truth (gd,m,θ)(g_d,m,\theta)):

ψATE(W)=(g1g0θ)+Dm(Yg1)1D1m(Yg0).\psi_{\text{ATE}}(W)=\big(g_1-g_0-\theta\big)+\frac{D}{m}(Y-g_1)-\frac{1-D}{1-m}(Y-g_0).

Compact identity (useful in code). With gD(X)=Dg1(X)+(1D)g0(X)g_D(X)=Dg_1(X)+(1-D)g_0(X) and r=YgD(X)r=Y-g_D(X),

ψATE=(g1g0θ)+(Dm1D1m)r.\psi_{\text{ATE}}=(g_1-g_0-\theta)+\Big(\tfrac{D}{m}-\tfrac{1-D}{1-m}\Big)\,r.

3.2 ATT (a.k.a. ATET/ATTE)

Let p1=E[D]p_1=\mathbb E[D] (estimated by the sample mean p^1=En[D]\hat p_1=\mathbb E_n[D]). Define the control reweighting factor r0(X)=m^1m^r_0(X)=\frac{\hat m}{1-\hat m}. Then

ψa=Dp1,ψb=Dp1(g^1g^0)+Dp1(Yg^1)1Dp1m^1m^(Yg^0).\psi_a=-\frac{D}{p_1},\qquad \psi_b=\frac{D}{p_1}(\hat g_1-\hat g_0) +\frac{D}{p_1}(Y-\hat g_1) -\frac{1-D}{p_1}\frac{\hat m}{1-\hat m}(Y-\hat g_0).

Estimator and influence function:

θ^ATT=En[ψb],ψ^i=ψb,i+ψa,iθ^ATT.\hat\theta_{\text{ATT}}=\mathbb E_n[\psi_b],\qquad \hat\psi_i=\psi_{b,i}+\psi_{a,i}\hat\theta_{\text{ATT}}.

Because En[D/p^1]=1\mathbb E_n[D/\hat p_1]=1, this choice centers ψ^\hat\psi at zero in-sample. If you use fold-specific p^1,k\hat p_{1,k}, either center ψ^\hat\psi per fold or re-express everything with the global p^1\hat p_1.

> Equivalent residual-weight form. With wˉ=m^/p1\bar w=\hat m/p_1: > > &gt; \bar w(u_1 h_1-u_0 h_0)=\frac&#123;D&#125;&#123;p_1&#125;(Y-\hat g_1)-\frac&#123;1-D&#125;&#123;p_1&#125;\frac&#123;\hat m&#125;&#123;1-\hat m&#125;(Y-\hat g_0). &gt;

Efficient influence function (truth (gd,m,θ)(g_d,m,\theta)):

ψATT(W)=Dp1(g1g0θ)+Dp1(Yg1)1Dp1m1m(Yg0).\psi_{\text{ATT}}(W)=\frac{D}{p_1}\big(g_1-g_0-\theta\big) +\frac{D}{p_1}(Y-g_1) -\frac{1-D}{p_1}\frac{m}{1-m}(Y-g_0).

Weight-sum identity (diagnostic).

E ⁣[1Dp1m1m]=1p1E ⁣[(1D)m1m]=1p1E[m]=p1p1=1.\mathbb{E}\!\left[\frac{1-D}{p_1}\frac{m}{1-m}\right]=\frac{1}{p_1}\,\mathbb{E}\!\left[(1-D)\frac{m}{1-m}\right]=\frac{1}{p_1}\,\mathbb{E}[m]=\frac{p_1}{p_1}=1.

Equivalently, without the 1/p11/p_1 factor:

E ⁣[(1D)m1m]=E[D]=p1.\mathbb{E}\!\left[(1-D)\frac{m}{1-m}\right]=\mathbb{E}[D]=p_1.

So in-sample diagnostics can be phrased as either:

  • i(1di)m^i1m^iidi\sum_{i}(1-d_i)\tfrac{\hat m_i}{1-\hat m_i}\approx \sum_i d_i (raw factors), or
  • i(1di)m^i(1m^i)p1n\sum_{i}(1-d_i)\tfrac{\hat m_i}{(1-\hat m_i)p_1}\approx n (ATT weights match treated weights idi/p1=n\sum_i d_i/p_1 = n).

3.3 Orthogonality (Neyman)

For η=(g0,g1,m)\eta=(g_0,g_1,m),

tE[ψ(W;θ0,η0+th)]t=0=0\frac{\partial}{\partial t}\,\mathbb E\big[\psi(W;\theta_0,\eta_0+t\,h)\big]\Big|_{t=0}=0

for rich directions hh. Causalis provides OOS moment checks and numerical derivative diagnostics to assess this.

Useful partial derivatives (ATE):

ψg1=1Dm,ψg0=1+1D1m,ψm=Dm2(Yg1)1D(1m)2(Yg0),\frac{\partial\psi}{\partial g_1}=1-\frac{D}{m},\quad \frac{\partial\psi}{\partial g_0}=-1+\frac{1-D}{1-m},\quad \frac{\partial\psi}{\partial m}=-\frac{D}{m^2}(Y-g_1)-\frac{1-D}{(1-m)^2}(Y-g_0),

whose conditional expectations given XX vanish at truth.

Useful partial derivatives (ATT):

ψg1=Dp1Dp1=0,ψg0=Dp1+1Dp1m1m,ψm=1Dp11(1m)2(Yg0),\frac{\partial\psi}{\partial g_1}=\tfrac{D}{p_1}-\tfrac{D}{p_1}=0,\quad \frac{\partial\psi}{\partial g_0}=-\tfrac{D}{p_1}+\tfrac{1-D}{p_1}\tfrac{m}{1-m},\quad \frac{\partial\psi}{\partial m}=-\frac{1-D}{p_1}\frac{1}{(1-m)^2}(Y-g_0),

and each has zero conditional mean at truth.

4. Estimation (cross-fitting)

  • Split into KK folds (stratified by DD).
  • On each train fold, fit learners for g0,g1,mg_0,g_1,m; predict on the held-out fold to build cross-fitted g^0,g^1,m^\hat g_0,\hat g_1,\hat m.
  • Compute θ^\hat\theta as above. Let ψ^i\hat\psi_i be the estimated influence function values (computed OOS per fold).

Variance and CI (single parameter):

Var^(θ^)=1n(1ni=1nψ^i2)=1n2i=1nψ^i2,se=Var^(θ^),CI1α=θ^±z1α/2se.\widehat{\mathrm{Var}}(\hat\theta)=\frac{1}{n}\Big(\frac{1}{n}\sum_{i=1}^n \hat\psi_i^2\Big) =\frac{1}{n^2}\sum_{i=1}^n \hat\psi_i^2,\quad \mathrm{se}=\sqrt{\widehat{\mathrm{Var}}(\hat\theta)},\quad \text{CI}_{1-\alpha}=\hat\theta\pm z_{1-\alpha/2}\,\mathrm{se}.

Hájek normalization (optional, ATE). Replace h1=D/m^h_1=D/\hat m with h1=D/m^En[D/m^]h_1^\star=\dfrac{D/\hat m}{\mathbb E_n[D/\hat m]} and similarly h0h_0^\star. This preserves asymptotics (orthogonality) and can reduce finite-sample variance; it slightly alters the finite-sample IF, so use the normalized ψ^\hat\psi in variance calculations. For ATT, it’s common to normalize control weights so their sum matches the count of treated (the diagnostic above already ensures this in expectation).

5. Positivity (overlap) & trimming

  • EDA reports the distribution of m^\hat m and the share near 0 or 1.
  • Clipping m^\hat m stabilizes the IPW terms. Under positivity (true mm bounded away from 0 and 1) it is asymptotically innocuous for ATE/ATT but may introduce small finite-sample bias; hard trimming (dropping units by a propensity threshold) changes the target population and should be interpreted accordingly.

6. Refutation diagnostics

6.1 OOS moment checks

Compute ψ^\hat\psi on each test fold using fold-specific nuisances. Verify the empirical mean of ψ^\hat\psi (and conditional moments against simple basis functions of XX) is close to zero. APIs: refute_irm_orthogonality, oos_moment_check, influence_summary.

6.2 Orthogonality derivatives

Numerically evaluate derivatives of the moment condition w.r.t. perturbations in (g0,g1,m)(g_0,g_1,m) at θ^\hat\theta. Small magnitudes support Neyman orthogonality in practice. APIs: orthogonality_derivatives (ATE) and ATT variants. (Derivatives summarized in §3.3.)

6.3 Sensitivity (heuristic bias bounds)

Following the DoubleML structure, Causalis exposes a Cauchy–Schwarz–style worst-case bias bound aligned with the score:

  • Outcome noise scale (pooled):
σ^2=En[(Yg^D(X))2],g^D(X)=Dg^1(X)+(1D)g^0(X).\hat\sigma^2=\mathbb E_n\big[(Y-\hat g_D(X))^2\big],\quad \hat g_D(X)=D\hat g_1(X)+(1-D)\hat g_0(X).
  • Score-weight norm (ATE):
v^ATE2=En[(Dm^1D1m^)2].\hat v^2_{\text{ATE}}=\mathbb E_n\Big[\Big(\tfrac{D}{\hat m}-\tfrac{1-D}{1-\hat m}\Big)^2\Big].

(Equivalently En[(D/m^)2]+En[((1D)/(1m^))2]\mathbb E_n[(D/\hat m)^2]+\mathbb E_n[((1-D)/(1-\hat m))^2] since D(1D)=0D(1-D)=0 pointwise.)

  • Score-weight norm (ATT):
v^ATT2=En ⁣[(D/p1)2]+En ⁣[((1D)m^/{p1(1m^)})2].\hat v^2_{\text{ATT}}=\mathbb E_n\!\big[(D/p_1)^2\big]+\mathbb E_n\!\big[\big((1-D)\,\hat m/\{p_1(1-\hat m)\}\big)^2\big].

For user-chosen sensitivity multipliers cy,cd0c_y,c_d\ge0 and correlation cap ρ[0,1]\rho\in[0,1],

max_bias(ρ)    ρ(cyσ^2)(cdv^2).\mathrm{max\_bias}(\rho)\;\le\;\rho\,\sqrt{(c_y\,\hat\sigma^2)\,(c_d\,\hat v^2)}.

Optional refinement (tighter but still heuristic). Split outcome variance by arm:

σ^12=En[(Yg^1)2D=1],σ^02=En[(Yg^0)2D=0],\hat\sigma_1^2=\mathbb E_n[(Y-\hat g_1)^2\mid D=1],\quad \hat\sigma_0^2=\mathbb E_n[(Y-\hat g_0)^2\mid D=0],

and use

σ^12En[(D/m^)2]+σ^02En[((1D)/(1m^))2]\sqrt{\hat\sigma_1^2\,\mathbb E_n[(D/\hat m)^2]+\hat\sigma_0^2\,\mathbb E_n[((1-D)/(1-\hat m))^2]}

in place of σ^2v^2\sqrt{\hat\sigma^2\,\hat v^2} for ATE. APIs: IRM.sensitivity_analysis, refutation/sensitivity.py. Bounds remain heuristic (not identified).

7. Implementation guardrails

  • Abort/warn if p^1{0,1}\hat p_1\in\{0,1\} (ATT undefined / division by zero).
  • Enforce small clipping on m^\hat m and (for classifiers) on g^d\hat g_d to prevent exploding residual-weight products.
  • Stratify folds by DD when cross-fitting.
  • If using fold-specific denominators (e.g., p^1,k\hat p_{1,k}), ensure fold-wise centering of ψ^\hat\psi or re-express with global p^1\hat p_1.