Research24 min read

Refutation of DML IRM inference

Automated conversion of refutation_flow.ipynb

Refutation of DML IRM inference

We will use DGP from Causalis. Read more at https://causalis.causalcraft.com/articles/generate_obs_hte_26_rich

Result
ydtenure_monthsavg_sessions_weekspend_last_monthage_yearsincome_monthlyprior_purchases_12msupport_tickets_90dpremium_usermobile_userurban_residentreferred_usermm_obstau_linkg0g1cate
00.0000000.028.8146541.077.93676750.2341011926.6983011.02.01.01.01.00.00.0454530.0454530.0890958.1379819.1423951.004414
180.0996111.025.9133453.053.77774028.1158595104.2715093.00.01.01.00.01.00.0415140.0415140.24667960.45925778.81730718.358049
26.4004821.024.96992910.0134.76432222.9070625267.9382558.03.00.01.01.00.00.0525930.0525930.1629687.7128559.1385771.425723
32.7882380.040.6550895.059.51707431.9704906597.3270183.02.01.01.01.00.00.0362210.0362210.18875525.38651031.1599325.773422
40.0000000.018.5608993.074.37093039.2372484930.0096285.01.01.01.00.00.00.0363430.0363430.17475715.35925018.6002273.240977
Result

CausalData(df=(100000, 13), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'age_years', 'income_monthly', 'prior_purchases_12m', 'support_tickets_90d', 'premium_user', 'mobile_user', 'urban_resident', 'referred_user'])

Inference

Result
value
field
estimandATTE
modelIRM
value11.9222 (ci_abs: 7.3425, 16.5018)
value_relative25.5928 (ci_rel: 15.5972, 35.5885)
alpha0.0500
p_value0.0000
is_significantTrue
n_treated4949
n_control95051
treatment_mean58.5062
control_mean76.0871
time2026-02-20

Overlap

What “overlap/positivity” means

Binary treatment T{0,1}T \in \{0,1\}: for all confounder values xx in your target population,

0<e(x):=P(T=1X=x)<1,0 < e(x) := P(T = 1 \mid X = x) < 1,

often strengthened to strong positivity: there exists an ε>0\varepsilon > 0 such that

εe(x)1εalmost surely.\varepsilon \le e(x) \le 1 - \varepsilon \quad\text{almost surely.}

Why it matters

  • Identification: Overlap + unconfoundedness are the two pillars that identify causal effects from observational data. Without overlap, the effect is not identified — you must extrapolate or model-specify what never occurs.

  • Estimation stability: IPW/DR estimators use weights

w1=De(X),w0=1D1e(X). w_1 = \frac{D}{e(X)}, \qquad w_0 = \frac{1 - D}{1 - e(X)}.

If e(X)e(X) is near 0 or 1, these weights explode, causing huge variance and fragile estimates.

  • Target population: With trimming or restriction, you may change who the effect describes — e.g., ATE on the region of common support, not on the full population.

It's summary for overlap diagnostics

Result
metricvalueflag
0edge_0.01_below0.018540GREEN
1edge_0.01_above0.000000GREEN
2edge_0.02_below0.112650RED
3edge_0.02_above0.000000RED
4KS0.190677GREEN
5AUC0.626645GREEN
6ESS_treated_ratio0.464357GREEN
7ESS_control_ratio0.997961GREEN
8tails_w1_q99/med5.191133YELLOW
9tails_w0_q99/med1.184178GREEN
10ATT_identity_relerr0.012196GREEN
11clip_m_total0.018540GREEN
12calib_ECE0.005415GREEN
13calib_slope0.671576YELLOW
14calib_intercept-0.907254RED

edge_mass

edge_0.01_below, edge_0.01_above, edge_0.02_below, edge_0.02_above are shares of units whose propensity is below or above the percents

To keep in mind: DML IRM is clipping out the interval [0.02 and 0.98]

Huge shares are dangerous for estimation in terms of weights exploding

Flags in Causalis:

For ε=0.01ε=0.01: YELLOW if either side 0.02 (2%), RED if 0.05 (5%).

For ε=0.02ε=0.02: YELLOW if either side 0.05 (5%), RED if 0.10 (10%).

Result

{'share_below_001': 0.01854, 'share_above_001': 0.0, 'share_below_002': 0.11265, 'share_above_002': 0.0, 'min_m': 0.00026454670780584426, 'max_m': 0.6376540197829689}

ks - Kolmogorov–Smirnov statistic

Here KS is the two-sample Kolmogorov–Smirnov statistic comparing the distributions of the propensities for treated vs control:

D=maxtF^A(t)F^B(t)D=\max_t |\hat F_A(t)-\hat F_B(t)|

Interpretation:

  • (D=0): identical distributions (perfect overlap).
  • (D=1): complete separation (no overlap).
  • Your value KS = 0.5116 means there exists a threshold (t) such that the share of treated with mtm\le t differs from the share of controls with mtm\le t by ~51 percentage points. That’s why it’s flagged RED (your thresholds mark RED when KS > 0.35): treatment assignment is highly predictable from covariates ⇒ poor overlap / strong confounding risk.
Result

0.19067652887832237

AUC

Probability definition (most intuitive)

AUC=Pr(s+>s)+12Pr(s+=s),\text{AUC} = \Pr\big(s^+ > s^-\big)+\tfrac12\Pr\big(s^+ = s^-\big),

where s+s^+ is a score from a random positive and ss^- from a random negative. So AUC is the fraction of all n1n0n_1 n_0 positive–negative pairs that are correctly ordered by the score (ties get half-credit).

Rank / Mann–Whitney formulation

  1. Rank all scores together (ascending). If there are ties, assign average ranks within each tied block.
  2. Let R1R_1 be the sum of ranks for the positives.
  3. Compute the Mann–Whitney (U) statistic for positives:
U1=R1n1(n1+1)2. U_1 = R_1 - \frac{n_1(n_1+1)}{2}.
  1. Convert to AUC by normalizing:
AUC=U1n1n0=R1n1(n1+1)2n1n0 \boxed{\text{AUC} = \frac{U_1}{n_1 n_0} = \frac{R_1 - \frac{n_1(n_1+1)}{2}}{n_1 n_0}}

This is exactly what your function returns (with stable sorting and tie-averaged ranks).

ROC-integral view (equivalent)

If TPR(t)\text{TPR}(t) and FPR(t)\text{FPR}(t) trace the ROC curve as the threshold tt moves,

AUC=01TPR(FPR1(u))du,\text{AUC} = \int_0^1 \text{TPR}\big(\text{FPR}^{-1}(u)\big)du,

i.e., the geometric area under the ROC.

Properties you should remember

  • Range: 0AUC10 \le \text{AUC} \le 1; 0.5 = random ranking; 1 = perfect separation.
  • Symmetry: AUC(s,y)=1AUC(s,1y)\text{AUC}(s,y) = 1 - \text{AUC}(s,1-y).
  • Monotone invariance: Any strictly increasing transform ff leaves AUC unchanged (only ranks matter).
  • Ties: Averaged ranks ⇒ adds the 12Pr(s+=s)\tfrac12\Pr(s^+=s^-) term automatically.

In the propensity/overlap context

  • A higher AUC means treatment (D) is more predictable from covariates (bad for overlap/positivity).
  • For good overlap you actually want AUC close to 0.5.
Result

0.6266454580150003

ESS_treated_ratio

Weights used

For ATE-style IPW, the treated-arm weights are

wi=Dimiwi={1/mi,Di=10,Di=0w_i = \frac{D_i}{m_i} \quad\Rightarrow\quad w_i = \begin{cases} 1/m_i,& D_i=1\\ 0,& D_i=0 \end{cases}

so on the treated subset {i:Di=1}\{i:D_i=1\} the weights are simply 1/mi1/m_i.

Effective sample size (ESS)

Given the treated-arm weights w1,,wn1w_1,\ldots,w_{n_1} (only for D=1D=1),

ESS=(i=1n1wi)2i=1n1wi2.\mathrm{ESS} = \frac{\left(\sum_{i=1}^{n_1} w_i\right)^2}{\sum_{i=1}^{n_1} w_i^2}.

This is exactly what _ess(w) computes.

  • If all treated weights are equal, ESS =n1= n_1 (full efficiency).
  • If a few weights dominate, ESS drops (information concentrated in few units).

The reported metric

ESStreated ratio=ESSn1=(iwi)2n1iwi2\boxed{ \mathrm{ESS}_{\text{treated ratio}} = \frac{\mathrm{ESS}}{n_1} = \frac{\left(\sum_i w_i\right)^2}{n_1 \sum_i w_i^2} }

This lies in (0,1](0,1]. Near 1 ⇒ well-behaved weights; near 0 ⇒ severe instability.

Why it reflects overlap

When propensities mim_i approach 0 for treated units, weights 1/mi1/m_i explode → large CV → low ESS_treated_ratio. Hence this metric is a direct, quantitative read on how much usable information remains in the treated group after IPW.

Result

{'ess_w1': 2298.1025570851816, 'ess_w0': 94857.16349432156, 'ess_ratio_w1': 0.4643569523308106, 'ess_ratio_w0': 0.9979607105061657}

tails_w1_q99/med

tails_w1_q99/med=Q0.99(W1)median(W1).\boxed{ \texttt{tails\_w1\_q99/med} = \frac{Q_{0.99}(W_1)}{\mathrm{median}(W_1)}. }

Interpretation

  • It’s a tail-heaviness index for treated weights: how large the 99th-percentile weight is relative to a typical (median) weight.
  • Scale-invariant: if you re-scale weights (e.g., Hájek normalization), both numerator and denominator scale equally, so the ratio is unchanged.
  • Bigger \Rightarrow heavier right tail \Rightarrow more variance inflation for IPW (since variance depends on large wi2w_i^2). It typically coincides with a low ESStreated ratioESS_{\text{treated ratio}}.

Edge cases & thresholds

  • If median(W1)=0\text{median}(W_1)=0 or undefined, the ratio is not meaningful (your code returns “NA” in that case; with positive treated weights this is rare).
  • Defaults: YELLOW if any of {q95/med,q99/med,q999/med,max/med}\{q95/med,q99/med,q999/med,\max/med\} exceeds 10; RED if any exceed 100. tails_w1_q99/med is one of these checks, focusing specifically on the 99th percentile.

Quick example

If median(W1)=1.2\mathrm{median}(W_1) = 1.2 and Q0.99(W1)=46.8Q_{0.99}(W_1) = 46.8, then

tails_w1_q99/med=46.81.239,\texttt{tails\_w1\_q99/med} = \frac{46.8}{1.2} \approx 39,

indicating heavy tails and a likely unstable ATE IPW.

Result

{'w1': {'q95': 54.954191239648, 'q99': 103.63534845395834, 'q999': 326.6877024294826, 'max': 1080.1560842371114, 'median': 19.963916032820617}, 'w0': {'q95': 1.126007261158915, 'q99': 1.2305275677823044, 'q999': 1.4996542994796511, 'max': 2.759793276583444, 'median': 1.0391406792873217}}

ATT_identity_relerr

With estimated propensities mi=m^(Xi)m_i=\hat m(X_i) and Di{0,1}D_i\in\{0,1\}:

  • Left-hand side (controls odds sum):
LHS=i=1n(1Di)mi1mi. \text{LHS} = \sum_{i=1}^n (1-D_i)\frac{m_i}{1-m_i}.
  • Right-hand side (treated count):
RHS=i=1nDi=n1. \text{RHS} = \sum_{i=1}^n D_i = n_1.

If m^m\hat m\approx m and overlap is ok, LHS \approx RHS.

You report the relative error:

ATT_identity_relerr=LHSRHSRHS\boxed{ \texttt{ATT\_identity\_relerr} = \frac{\big|\mathrm{LHS} - \mathrm{RHS}\big|}{\mathrm{RHS}} }

(when n1>0n_1>0 otherwise it’s set to \infty).

How to read the number

  • Small relerr\texttt{relerr} (e.g., 5\le 5%) ⇒ propensities are reasonably calibrated (especially on the control side) and ATT weights won’t be wildly off in total mass.
  • Large relerr\texttt{relerr} ⇒ possible miscalibration of m^\hat m (e.g., over/underestimation for controls), poor overlap (many controls with mi1m_i\to 1 inflating mi/(1mi))m_i/(1-m_i)), or clipping/trimming effects.

Your default flags (same as in the code):

  • GREEN if relerr0.05\texttt{relerr} \le 0.05
  • YELLOW if 0.05<relerr0.100.05 < \texttt{relerr} \le 0.10
  • RED if >0.10> 0.10

Quick intuition

The term m/(1m)m/(1-m) is the odds of treatment. Summing that over controls should reconstruct the treated count. If it doesn’t, either the odds are off (propensity miscalibration) or the data lack support where you need it—both are red flags for ATT-IPW stability.

Result

{'lhs_sum': 4888.639934058683, 'rhs_sum': 4949.0, 'rel_err': 0.012196416637970673}

clip_m_total

look at edge_mass

Result

{'m_clip_lower': 0.01854, 'm_clip_upper': 0.0}

calib_ECE, calib_slope, calib_intercept

calib_ECE = 0.018 (GREEN)

Math: with 10 equal-width bins,

ECE=k=110nkn,yˉkpˉk\text{ECE}=\sum_{k=1}^{10}\frac{n_k}{n},|\bar y_k-\bar p_k|

(weighted average gap between observed rate (\bar y_k) and mean prediction (\bar p_k) per bin). Result: ~1.8% average miscalibration → overall probabilities track outcomes well. Note the biggest bin error is in 0.5–0.6 (abs_error ≈ 0.162) but it’s tiny (95/10,000), so ECE stays low.

calib_slope (β) = 0.889 (GREEN)

Math (logistic recalibration):

Pr(D=1p)=σ(α+β,logit(p)).\Pr(D=1\mid p)=\sigma(\alpha+\beta,\text{logit}(p)).

Interpretation: (\beta<1) ⇒ predictions are a bit over-confident (too extreme); the optimal calibration slightly flattens them toward 0.5.

calib_intercept (α) = −0.107 (GREEN)

Math: same model as above; (\alpha) is a vertical shift on the log-odds scale. Interpretation: Negative α\alpha nudges probabilities downward overall (your model is, on average, a bit high), consistent with bins like 0.5–0.6 where pˉk>yˉk\bar p_k > \bar y_k.

All three fall well within your GREEN thresholds, so calibration looks solid despite minor mid-range overprediction.

Result

{'n': 100000, 'n_bins': 10, 'auc': 0.6266454580150003, 'brier': 0.046645568204131085, 'ece': 0.005414555766917116, 'reliability_table': bin lower upper count mean_p frac_pos abs_error 0 0 0.0 0.1 92701 0.040462 0.044045 0.003583 1 1 0.1 0.2 6411 0.131200 0.112463 0.018737 2 2 0.2 0.3 709 0.236655 0.172073 0.064582 3 3 0.3 0.4 135 0.337102 0.133333 0.203769 4 4 0.4 0.5 32 0.435279 0.093750 0.341529 5 5 0.5 0.6 8 0.559499 0.250000 0.309499 6 6 0.6 0.7 4 0.626106 0.000000 0.626106 7 7 0.7 0.8 0 NaN NaN NaN 8 8 0.8 0.9 0 NaN NaN NaN 9 9 0.9 1.0 0 NaN NaN NaN, 'recalibration': {'intercept': -0.9072535677540645, 'slope': 0.6715760602070323}, 'flags': {'ece': 'GREEN', 'slope': 'YELLOW', 'intercept': 'RED'}}

Result
binloweruppercountmean_pfrac_posabs_error
000.00.1927010.0404620.0440450.003583
110.10.264110.1312000.1124630.018737
220.20.37090.2366550.1720730.064582
330.30.41350.3371020.1333330.203769
440.40.5320.4352790.0937500.341529
550.50.680.5594990.2500000.309499
660.60.740.6261060.0000000.626106
770.70.80NaNNaNNaN
880.80.90NaNNaNNaN
990.91.00NaNNaNNaN

Score

We need this score refutation tests for:

  • Catch overfitting/leakage: The out-of-sample moment check verifies that the AIPW score averages to ~0 on held-out folds using fold-specific θ and nuisances. If this fails, your effect can be an artifact of leakage or overfit learners rather than a real signal.
  • Verify Neyman orthogonality in practice: The Gateaux-derivative tests (orthogonality_derivatives) check that small, targeted perturbations to the nuisances (g₀, g₁, m) don’t move the score mean. Large |t| values flag miscalibration (e.g., biased propensity or outcome models) that breaks the orthogonality protection DML relies on.
  • Assess finite-sample stability: The influence diagnostics reveal heavy tails (p99/median, kurtosis) and top-influential points. Spiky ψ implies high variance and sensitivity—often due to near-0/1 propensities, poor overlap, or outliers.
  • ATTE-specific risks: For ATT/ATTE, only g₀ and m matter in the score. The added overlap metrics and trim curves show how reliant your estimate is on scarce, high-m controls—common failure mode for ATT.
Result
metricvalueflag
0se_plugin2.336587NA
1psi_p99_over_med64.756207RED
2psi_kurtosis3085.930676RED
3max_|t|_g10.000000GREEN
4max_|t|_g00.738424GREEN
5max_|t|_m1.612125GREEN
6oos_tstat_fold-0.000196GREEN
7oos_tstat_strict-0.000196GREEN

psi_p99_over_med

  • Let ψi\psi_i be the per-unit influence value (EIF score) for your estimator. We look at magnitudes aiψia_i \equiv |\psi_i|.

  • Define the 99th percentile and the median of these magnitudes:

q0.99Quantile0.99(a1,,an),mmedian(a1,,an). q_{0.99} \equiv \operatorname{Quantile}_{0.99}(a_1,\dots,a_n),\qquad m \equiv \operatorname{median}(a_1,\dots,a_n).
  • The metric is the scale-free tail ratio:
psi_p99_over_med=q0.99m \boxed{ \texttt{psi\_p99\_over\_med} = \frac{q_{0.99}}{m} }

Why this works (brief):

  • Uses ψi|\psi_i| to ignore sign (only tail size matters).
  • Dividing by the median makes it scale-invariant and robust to a few large values.
  • Large values (1)\big(\gg 1\big) mean a small fraction of observations dominate uncertainty (heavy tails → unstable SE).

Quick read:

  • 1!!5\approx 1!-!5: tails tame/stable
  • 10\gtrsim 10: caution (heavy tails)
  • 20\gtrsim 20: likely unstable; check overlap, trim/clamp propensities, or robustify learners.
Result

{'se_plugin': 2.33658747572632, 'kurtosis': 3085.930675575231, 'p99_over_med': 64.7562071038307, 'top_influential': i psi m res_t res_c 0 67446 71506.706325 0.014464 4675.799768 0.000000 1 65005 64940.749295 0.017769 3408.282373 0.000000 2 3618 57509.364010 0.022839 2825.678117 0.000000 3 79652 -57249.417825 0.610355 0.000000 1808.738763 4 96126 53276.215692 0.052542 2509.536113 0.000000 5 48708 46302.067362 0.026377 2213.743705 0.000000 6 87181 43245.822176 0.059510 2246.844037 0.000000 7 56222 35052.876932 0.035378 1578.986396 0.000000 8 57635 34399.580082 0.074460 1855.306056 0.000000 9 47882 32356.549510 0.015467 1369.168501 0.000000}

psi_kurtosis

  • Let ψi\psi_i be the per-unit influence values and define centered residuals
ψ~iψiψˉ,ψˉ1ni=1nψi. \tilde\psi_i \equiv \psi_i - \bar\psi,\qquad \bar\psi \equiv \frac{1}{n}\sum_{i=1}^n \psi_i.
  • Sample variance (with Bessel correction):
s21n1i=1nψ~i2. s^2 \equiv \frac{1}{n-1}\sum_{i=1}^n \tilde\psi_i^2.
  • Sample 4th central moment:
μ^41ni=1nψ~i4. \hat\mu_4 \equiv \frac{1}{n}\sum_{i=1}^n \tilde\psi_i^4.
  • The reported metric (raw kurtosis, not excess):
psi_kurtosis=μ^4s4 \boxed{ \texttt{psi\_kurtosis} = \frac{\hat{\mu}_4}{s^4} }

Interpretation (quick):

  • Normal reference 3\approx 3 (excess kurtosis =0=0).
  • Much larger \Rightarrow heavier tails / more extreme ψi\psi_i outliers.
  • Rules of thumb used in the diagnostics: 10\ge 10 = caution, 30\ge 30 = severe.
Result

{'se_plugin': 2.33658747572632, 'kurtosis': 3085.930675575231, 'p99_over_med': 64.7562071038307, 'top_influential': i psi m res_t res_c 0 67446 71506.706325 0.014464 4675.799768 0.000000 1 65005 64940.749295 0.017769 3408.282373 0.000000 2 3618 57509.364010 0.022839 2825.678117 0.000000 3 79652 -57249.417825 0.610355 0.000000 1808.738763 4 96126 53276.215692 0.052542 2509.536113 0.000000 5 48708 46302.067362 0.026377 2213.743705 0.000000 6 87181 43245.822176 0.059510 2246.844037 0.000000 7 56222 35052.876932 0.035378 1578.986396 0.000000 8 57635 34399.580082 0.074460 1855.306056 0.000000 9 47882 32356.549510 0.015467 1369.168501 0.000000}

maxtg1max_|t|_g1, maxtg0max_|t|_g0, maxtmmax_|t|_m

We work with a basis of functions

{hb(X)}b=0B1(columns of Xbasish01 is the constant).\{h_b(X)\}_{b=0}^{B-1} \quad\text{(columns of }X_{\text{basis}}\text{; }h_0\equiv 1\text{ is the constant).}

Let miτclip(mi,τ,1τ)m_i^\tau \equiv \mathrm{clip}(m_i,\tau,1-\tau) be the clipped propensity (guards against division by zero).

ATE case

For each basis function bb, form a sample mean (Gateaux derivative estimator) and its standard error, then compute a t-statistic; finally take the maximum absolute value across bases.


g1g_1 direction

d^g1,b=1ni=1nhb(Xi)(1Dimiτ),se(d^g1,b)=sd ⁣[hb(Xi)(1Dimiτ)]n.\widehat d_{g_1,b} = \frac{1}{n}\sum_{i=1}^n h_b(X_i) \Big(1 - \frac{D_i}{m_i^\tau}\Big), \qquad \mathrm{se}(\widehat d_{g_1,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i) \left(1-\frac{D_i}{m_i^\tau}\right)\right]}{\sqrt{n}}. tg1,b=d^g1,bse(d^g1,b),maxtg1=maxbtg1,b.t_{g_1,b} = \frac{\widehat d_{g_1,b}}{\mathrm{se}(\widehat d_{g_1,b})}, \qquad \boxed{ \max_{|t|_{g_1}} = \max_b |t_{g_1,b}| }.

g0g_0 direction

d^g0,b=1ni=1nhb(Xi)(1Di1miτ1),se(d^g0,b)=sd ⁣[hb(Xi)(1Di1miτ1)]n.\widehat d_{g_0,b} = \frac{1}{n}\sum_{i=1}^n h_b(X_i) \Big(\frac{1-D_i}{1-m_i^\tau} - 1\Big), \qquad \mathrm{se}(\widehat d_{g_0,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i) \left(\frac{1-D_i}{1-m_i^\tau} - 1\right)\right]}{\sqrt{n}}. tg0,b=d^g0,bse(d^g0,b),maxtg0=maxbtg0,b.t_{g_0,b} = \frac{\widehat d_{g_0,b}}{\mathrm{se}(\widehat d_{g_0,b})}, \qquad \boxed{ \max_{|t|_{g_0}} = \max_b |t_{g_0,b}| }.

mm direction

SiDi(Yig1,i)(miτ)2+(1Di)(Yig0,i)(1miτ)2.S_i \equiv \frac{D_i(Y_i - g_{1,i})}{(m_i^\tau)^2} + \frac{(1-D_i)(Y_i - g_{0,i})}{(1 - m_i^\tau)^2}. d^m,b=1ni=1nhb(Xi)Si,se(d^m,b)=sd ⁣[hb(Xi)Si]n.\widehat d_{m,b} = -\frac{1}{n}\sum_{i=1}^n h_b(X_i) S_i, \qquad \mathrm{se}(\widehat d_{m,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i)S_i\right]}{\sqrt{n}}. tm,b=d^m,bse(d^m,b),maxtm=maxbtm,b.t_{m,b} = \frac{\widehat d_{m,b}}{\mathrm{se}(\widehat d_{m,b})}, \qquad \boxed{ \max_{|t|_{m}} = \max_b |t_{m,b}| }.

Interpretation: under Neyman orthogonality, each derivative mean d^,b\widehat d_{\bullet,b} should be approximately zero, so all t,b|t_{\bullet,b}| should be small. Large maxt\max_{|t|} values flag miscalibration of the corresponding nuisance.


ATTE / ATT case

Let p1=E[D]p_1 = \mathbb{E}[D] and define the odds oi=miτ/(1miτ)o_i = m_i^\tau / (1 - m_i^\tau).

  • The g1g_1 derivative is identically zero:
maxtg1=0. \Rightarrow\quad \max_{|t|_{g_1}} = 0.
  • g0g_0 direction
d^g0,b=1nihb(Xi)(1Di)oiDip1,tg0,b=d^g0,bse(d^g0,b),maxtg0=maxbtg0,b. \widehat d_{g_0,b} = \frac{1}{n}\sum_i h_b(X_i)\frac{(1-D_i)o_i - D_i}{p_1}, \qquad t_{g_0,b} = \frac{\widehat d_{g_0,b}}{\mathrm{se}(\widehat d_{g_0,b})}, \qquad \max_{|t|_{g_0}} = \max_b |t_{g_0,b}|.
  • mm direction
d^m,b=1nihb(Xi)(1Di)(Yig0,i)p1(1miτ)2,maxtm=maxbtm,b. \widehat d_{m,b} = -\frac{1}{n}\sum_i h_b(X_i) \frac{(1-D_i)(Y_i - g_{0,i})} {p_1(1 - m_i^\tau)^2}, \qquad \max_{|t|_{m}} = \max_b |t_{m,b}|.

Rule of thumb: maxt2\max_{|t|} \lesssim 2 is “okay”; larger values indicate orthogonality breakdown — fix by recalibrating that nuisance, changing learners, features, or trimming.

Result
basisd_g1se_g1t_g1d_g0se_g0t_g0d_mse_mt_m
000.00.00.0-0.0109800.014869-0.738424-3.80208912.314935-0.308738
110.00.00.0-0.0015910.013315-0.119473-12.82324223.914527-0.536211
220.00.00.0-0.0011870.013974-0.084931-21.70034138.284507-0.566818
330.00.00.00.0004530.0131330.034514-15.94803127.507951-0.579761
440.00.00.0-0.0051180.014957-0.342200-12.11287915.384346-0.787351
550.00.00.0-0.0011890.013786-0.086218-21.53686529.004829-0.742527
660.00.00.0-0.0012660.014546-0.087069-13.43719616.311453-0.823789
770.00.00.0-0.0105440.017775-0.59320218.30594111.3551591.612125
880.00.00.00.0065400.0178100.367205-9.8927067.936701-1.246451
990.00.00.00.0104670.0150840.693912-9.07194610.246696-0.885353
10100.00.00.00.0033090.0150680.219632-6.51531011.636338-0.559911
11110.00.00.0-0.0031090.015489-0.200714-9.61362313.281803-0.723819

oos_tstat_fold, oos_tstat_strict

Here’s the math behind the two OOS (out-of-sample) moment t-stats used in the diagnostics. Assume K-fold cross-fitting with held-out index sets IkI_k (size nkn_k) and complements RkR_k.


Step 1 — Leave-fold-out θ^k\hat\theta_{-k}

For the moment condition E[ψa(W)θ+ψb(W)]=0\mathbb{E}[\psi_a(W)\,\theta + \psi_b(W)] = 0, the leave-fold-out estimate used on fold kk is

θ^k=ψˉb,Rkψˉa,Rk,ψˉ,Rk=1RkiRkψ(Wi).\hat\theta_{-k} = -\frac{\bar\psi_{b,R_k}}{\bar\psi_{a,R_k}}, \qquad \bar\psi_{\cdot,R_k} = \frac{1}{|R_k|}\sum_{i\in R_k}\psi_{\cdot}(W_i).

Step 2 — Held-out scores on fold kk

Define the fold-specific held-out score for iIki\in I_k:

ψi(k)=ψb(Wi)+ψa(Wi)θ^k.\psi_i^{(k)} = \psi_b(W_i) + \psi_a(W_i)\,\hat\theta_{-k}.

Compute per-fold mean and variance:

ψˉk=1nkiIkψi(k),sk2=1nk1iIk ⁣(ψi(k)ψˉk)2.\bar\psi_k = \frac{1}{n_k}\sum_{i\in I_k}\psi_i^{(k)}, \qquad s_k^2 = \frac{1}{n_k-1}\sum_{i\in I_k}\!\big(\psi_i^{(k)}-\bar\psi_k\big)^2.

OOS t-stat diagnostics


oos_tstat_fold\texttt{oos\_tstat\_fold}

A fold-aggregated, variance-weighted t-statistic:

oos_tstat_fold=k=1Knkψˉkk=1Knksk2\boxed{ \texttt{oos\_tstat\_fold} = \frac{\displaystyle \sum_{k=1}^K n_k\,\bar\psi_k} {\displaystyle \sqrt{\sum_{k=1}^K n_k\,s_k^2}} }

Intuition: averages fold means and scales by a fold-pooled standard error.


oos_tstat_strict\texttt{oos\_tstat\_strict}

A “strict” t-stat using every held-out observation directly:

N=k=1Knk,ψˉall=1Nk=1KiIkψi(k).N = \sum_{k=1}^K n_k, \qquad \bar\psi_{\text{all}} = \frac{1}{N}\sum_{k=1}^K\sum_{i\in I_k}\psi_i^{(k)}. sall2=1N1k=1KiIk(ψi(k)ψˉall)2.s_{\text{all}}^2 = \frac{1}{N-1} \sum_{k=1}^K\sum_{i\in I_k} \big(\psi_i^{(k)} - \bar\psi_{\text{all}}\big)^2. oos_tstat_strict=ψˉallsall/N\boxed{ \texttt{oos\_tstat\_strict} = \frac{\bar\psi_{\text{all}}}{s_{\text{all}}/\sqrt{N}} }

Intuition: computes a single overall mean and standard error across all held-out scores (often slightly more conservative).


Interpretation

Under a valid design and correct cross-fitting (so that E[ψ]=0\mathbb{E}[\psi]=0 out-of-sample), both statistics are approximately standard normal:

two-sided p-value2(1Φ(t)).\text{two-sided p-value} \approx 2\big(1 - \Phi(|t|)\big).

Values near 00 indicate that the moment condition holds out of sample. Large t|t| suggests overfitting, leakage, or nuisance miscalibration.

Result

{'available': True, 'oos_tstat_fold': -0.00019613538740434564, 'oos_tstat_strict': -0.00019613466293191552, 'p_value_fold': 0.9998435066035664, 'p_value_strict': 0.9998435071816117, 'fold_table': fold n theta_minus_k psi_mean psi_var 0 0 20000 10.107551 9.073026 956612.184500 1 1 20000 12.516796 -2.973198 355816.110740 2 2 20000 11.546563 1.877966 551143.298220 3 3 20000 12.959614 -5.187289 541712.274267 4 4 20000 12.480716 -2.792798 325274.281795}

Result

{'params': {'score': 'ATTE', 'trimming_threshold': 0.01, 'normalize_ipw': False}, 'orthogonality_derivatives': basis d_g1 se_g1 t_g1 d_g0 se_g0 t_g0 d_m
0 0 0.0 0.0 0.0 -0.010980 0.014869 -0.738424 -3.802089
1 1 0.0 0.0 0.0 -0.001591 0.013315 -0.119473 -12.823242
2 2 0.0 0.0 0.0 -0.001187 0.013974 -0.084931 -21.700341
3 3 0.0 0.0 0.0 0.000453 0.013133 0.034514 -15.948031
4 4 0.0 0.0 0.0 -0.005118 0.014957 -0.342200 -12.112879
5 5 0.0 0.0 0.0 -0.001189 0.013786 -0.086218 -21.536865
6 6 0.0 0.0 0.0 -0.001266 0.014546 -0.087069 -13.437196
7 7 0.0 0.0 0.0 -0.010544 0.017775 -0.593202 18.305941
8 8 0.0 0.0 0.0 0.006540 0.017810 0.367205 -9.892706
9 9 0.0 0.0 0.0 0.010467 0.015084 0.693912 -9.071946
10 10 0.0 0.0 0.0 0.003309 0.015068 0.219632 -6.515310
11 11 0.0 0.0 0.0 -0.003109 0.015489 -0.200714 -9.613623

se_m t_m
0 12.314935 -0.308738
1 23.914527 -0.536211
2 38.284507 -0.566818
3 27.507951 -0.579761
4 15.384346 -0.787351
5 29.004829 -0.742527
6 16.311453 -0.823789
7 11.355159 1.612125
8 7.936701 -1.246451
9 10.246696 -0.885353
10 11.636338 -0.559911
11 13.281803 -0.723819 , 'influence_diagnostics': {'se_plugin': 2.33658747572632, 'kurtosis': 3085.930675575231, 'p99_over_med': 64.7562071038307, 'top_influential': i psi m res_t res_c 0 67446 71506.706325 0.014464 4675.799768 0.000000 1 65005 64940.749295 0.017769 3408.282373 0.000000 2 3618 57509.364010 0.022839 2825.678117 0.000000 3 79652 -57249.417825 0.610355 0.000000 1808.738763 4 96126 53276.215692 0.052542 2509.536113 0.000000 5 48708 46302.067362 0.026377 2213.743705 0.000000 6 87181 43245.822176 0.059510 2246.844037 0.000000 7 56222 35052.876932 0.035378 1578.986396 0.000000 8 57635 34399.580082 0.074460 1855.306056 0.000000 9 47882 32356.549510 0.015467 1369.168501 0.000000}, 'oos_moment_test': {'available': True, 'oos_tstat_fold': -0.00019613538740434564, 'oos_tstat_strict': -0.00019613466293191552, 'p_value_fold': 0.9998435066035664, 'p_value_strict': 0.9998435071816117, 'fold_table': fold n theta_minus_k psi_mean psi_var 0 0 20000 10.107551 9.073026 956612.184500 1 1 20000 12.516796 -2.973198 355816.110740 2 2 20000 11.546563 1.877966 551143.298220 3 3 20000 12.959614 -5.187289 541712.274267 4 4 20000 12.480716 -2.792798 325274.281795}, 'flags': {'psi_tail_ratio': 'RED', 'psi_kurtosis': 'RED', 'ortho_max_|t|g1': 'GREEN', 'ortho_max|t|g0': 'GREEN', 'ortho_max|t|m': 'GREEN', 'oos_moment': 'GREEN'}, 'thresholds': {'tail_ratio_warn': 10.0, 'tail_ratio_strong': 20.0, 'kurt_warn': 10.0, 'kurt_strong': 30.0, 't_warn': 2.0, 't_strong': 4.0}, 'overall_flag': 'RED', 'meta': {'n': 100000, 'score': 'ATTE', 'used_estimator_psi': True, 'uses_custom_weights': False}, 'summary': metric value flag 0 se_plugin 2.336587 NA 1 psi_p99_over_med 64.756207 RED 2 psi_kurtosis 3085.930676 RED 3 max|t|g1 0.000000 GREEN 4 max|t|g0 0.738424 GREEN 5 max|t|_m 1.612125 GREEN 6 oos_tstat_fold -0.000196 GREEN 7 oos_tstat_strict -0.000196 GREEN}

SUTVA

Result

1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?

Those assumptions are statistically untestable. We need design of research for them

Unconfoundedness

Result
metricvalueflag
0balance_max_smd0.010351GREEN
1balance_frac_violations0.000000GREEN

balance\_max\_smd

For each covariate XjX_j, the (weighted) standardized mean difference is

SMDj=μ1jμ0j12(σ1j2+σ0j2).\mathrm{SMD}_j = \frac{\big|\mu_{1j} - \mu_{0j}\big|} {\sqrt{\tfrac{1}{2}\big(\sigma_{1j}^2 + \sigma_{0j}^2\big)}}.

Group means and variances are computed under the IPW weights implied by your estimand:

  • ATE: w1i=Dim^iw_{1i} = \tfrac{D_i}{\hat m_i}, w0i=1Di1m^iw_{0i} = \tfrac{1-D_i}{1-\hat m_i}
  • ATTE: w1i=Diw_{1i} = D_i, w0i=(1Di)m^i1m^iw_{0i} = (1-D_i)\tfrac{\hat m_i}{1-\hat m_i}

(If normalize=True, each weight vector is divided by its mean.)

Weighted means and variances:

μgj=iwgiXijiwgi,σgj2=iwgi(Xijμgj)2iwgi,g{0,1}.\mu_{gj} = \frac{\sum_i w_{gi} X_{ij}}{\sum_i w_{gi}}, \qquad \sigma_{gj}^2 = \frac{\sum_i w_{gi}(X_{ij} - \mu_{gj})^2}{\sum_i w_{gi}}, \qquad g \in \{0,1\}.

Special cases in the code:

  • If both variances are 0\approx 0 and μ1jμ0j0|\mu_{1j}-\mu_{0j}| \approx 0SMDj=0\mathrm{SMD}_j = 0
  • If both variances are 0\approx 0 but means differ ⇒ SMDj=\mathrm{SMD}_j = \infty
  • If denominator is 0\approx 0 otherwise ⇒ SMDj=NaN\mathrm{SMD}_j = \text{NaN}

Then

balance_max_smd=maxjSMDj,\textbf{balance\_max\_smd} = \max_j \mathrm{SMD}_j,

implemented as a nanmax over the vector of SMDj\mathrm{SMD}_j. NaNs are ignored; if any feature produced \infty, the max is \infty.

balance\_frac\_violations

Let the SMD threshold be τ\tau (default 0.100.10). Define the set of finite SMDs:

J={j: SMDj is finite}.\mathcal{J} = \{ j : \ \mathrm{SMD}_j \text{ is finite} \}.

Then the fraction of violations is

balance_frac_violations=1JjJ1{SMDjτ}.\textbf{balance\_frac\_violations} = \frac{1}{|\mathcal{J}|} \sum_{j \in \mathcal{J}} \mathbf{1}\{ \mathrm{SMD}_j \ge \tau \}.

So it’s the share of covariates whose weighted SMD exceeds the threshold, computed only over finite SMDs (NaN / Inf are excluded from the denominator).


Quick interpretation

  • Smaller is better. A common rule of thumb is SMD0.10\mathrm{SMD} \le 0.10.
  • balance_max_smd tells you the worst residual imbalance across covariates.
  • balance_frac_violations tells you how many covariates (as a fraction) still exceed the chosen threshold.

Sensitivity analysis

1) sensitivity_analysis: bias-aware CI

Goal. Start from your estimator θ^\hat\theta with sampling standard error sese. Allow a controlled amount of worst-case hidden cofounding through three knobs cfy,cfd,ρcf_y, cf_d, \rho. Inflate the uncertainty by an additive “max bias”.


Step A — Sampling part

  • Point estimate θ^\hat\theta, standard error sese, and zαz_\alpha for level 1α1-\alpha.
  • Usual sampling CI:
[θ^zαse, θ^+zαse]. [\,\hat\theta - z_\alpha\,se,\ \hat\theta + z_\alpha\,se\,].

Step B — Cofounding geometry

The code pulls sensitivity elements from the fitted IRM:

  • σ2\sigma^2: the asymptotic variance of the estimator’s EIF (so that se=σ2se = \sqrt{\sigma^2} in the module’s normalization).

  • mα(i)0m_\alpha(i) \ge 0: per-unit weight for the outcome channel (how outcome-model misspecification moves the EIF).

  • r(i)r(i) (“riesz_rep”): per-unit weight for the treatment channel (how propensity-model misspecification moves the EIF).

We turn the user’s sensitivity knobs into a quadratic budget for adversarial cofounding:

ai:=2mα(i),bi:={r(i),(default, worst-case sign)r(i),(if use_signed_rr=True)basei=ai2cfy+bi2cfd+2ρcfycfdaibi0,ν2:=En[basei].\begin{aligned} a_i &:= \sqrt{2\,m_\alpha(i)}, \\[1mm] b_i &:= \begin{cases} |r(i)|, & \text{(default, worst-case sign)} \\ r(i), & \text{(if \texttt{use\_signed\_rr=True})} \end{cases} \\[2mm] \text{base}_i &= a_i^2\,cf_y + b_i^2\,cf_d + 2\,\rho\,\sqrt{cf_y\,cf_d}\,a_i b_i \ge 0, \\[2mm] \nu^2 &:= \mathbb{E}_n[\text{base}_i]. \end{aligned}
  • cfy0cf_y \ge 0: strength of unobserved outcome disturbance
  • cfd0cf_d \ge 0: strength of unobserved treatment disturbance
  • ρ[1,1]\rho \in [-1,1]: their correlation

This ν2\nu^2 is a dimensionless bias multiplier — how sensitive the EIF is to those perturbations.


Step C — Max bias and intervals

Two equivalent forms appear in the code:

max_bias=σ2ν2=(ν2)se.\text{max\_bias} = \sqrt{\sigma^2\,\nu^2} = \big(\sqrt{\nu^2}\big)\,se.

Then the module reports:

  • Cofounding bounds for θ\theta:
[θ^max_bias,  θ^+max_bias]. [\,\hat\theta - \text{max\_bias},\; \hat\theta + \text{max\_bias}\,].
  • Bias-aware CI (sampling + cofounding, worst-case additive):
[θ^(max_bias+zαse),  θ^+(max_bias+zαse)]. \Big[\,\hat\theta - (\text{max\_bias} + z_\alpha\,se),\; \hat\theta + (\text{max\_bias} + z_\alpha\,se)\,\Big].

(So you’re adding sampling error and the adversarial bias linearly for a conservative envelope.)


Notes & edge handling

  • Numeric PSD clamping ensures basei0\text{base}_i \ge 0; ρ\rho is clipped to [1,1][-1,1].
  • If cfy=cfd=0ν2=0cf_y = cf_d = 0 \Rightarrow \nu^2 = 0 \Rightarrow bias-aware CI collapses to the sampling CI.
  • Internally, a delta-method IF for max_bias\text{max\_bias} is
ψmax(i)=σ2ψν2(i)+ν2ψσ2(i)2max_bias, \psi_{\text{max}}(i) = \frac{\sigma^2\,\psi_{\nu^2}(i) + \nu^2\,\psi_{\sigma^2}(i)} {2\,\text{max\_bias}},

matching max_bias=σ2ν2\text{max\_bias} = \sqrt{\sigma^2\nu^2} (used for coherent summaries).


2) sensitivity_benchmark: calibrating cfy,cfd,ρcf_y, cf_d, \rho from omitted covariates

Goal. Pick a set ZZ of candidate “omitted” covariates (the benchmarking_set). Refit a short IRM that excludes ZZ and compare it to the long (original) model. Use how well ZZ explains residual variation to derive plausible cfy,cfd,ρcf_y, cf_d, \rho.


Step A — Long vs short estimates

  • Long: θ^long\hat\theta_{\text{long}} (original model).
  • Short: θ^short\hat\theta_{\text{short}} (drop ZZ, same learners/hyperparams).
  • Report Δ=θ^longθ^short\Delta = \hat\theta_{\text{long}} - \hat\theta_{\text{short}}.

Step B — Residuals from the long model

Let g1,g0,m^g_1, g_0, \hat m be the outcome and propensity learners:

ry:=Y(Dg1+(1D)g0),rd:=Dm^.r_y := Y - \big(D g_1 + (1-D) g_0\big), \qquad r_d := D - \hat m.

These are the EIF’s outcome and treatment residual components.


Step C — How much of each residual does ZZ explain?

Regress ryr_y on ZZ and rdr_d on ZZ (unweighted OLS; ATT case uses ATT weights):

  • Obtain Ry2R^2_y and Rd2R^2_d.
  • Convert to signal-to-noise ratios (the “strength” of cofounding channels):
cfy=Ry21Ry2,cfd=Rd21Rd2. cf_y = \frac{R^2_y}{1 - R^2_y}, \qquad cf_d = \frac{R^2_d}{1 - R^2_d}.

(These are the same R2/(1R2)R^2 / (1 - R^2) maps used in modern partial-R2R^2 robustness frameworks.)

Compute the correlation between the fitted pieces from those two regressions:

ρ=corr ⁣(r^y(Z), r^d(Z)),\rho = \operatorname{corr}\!\big(\widehat r_y(Z),\ \widehat r_d(Z)\big),

weighted for ATT when applicable, then clipped to [1,1][-1,1].


Outputs

A one-row DataFrame (indexed by the treatment name) with

{cfy, cfd, ρ, θ^long, θ^short, Δ}.\{\, cf_y,\ cf_d,\ \rho,\ \hat\theta_{\text{long}},\ \hat\theta_{\text{short}},\ \Delta \,\}.

You can pass cfy,cfd,ρcf_y, cf_d, \rho straight into sensitivity_analysis to get the associated bias-aware interval. Intuitively, this calibrates how strong hidden stuff would need to be by using a concrete, observed proxy ZZ.


How to read them together

  1. Use sensitivity_benchmark with a plausible omitted set ZZ to derive cfy,cfd,ρcf_y, cf_d, \rho and observe the actual estimate shift Δ\Delta.

  2. Plug those cfy,cfd,ρcf_y, cf_d, \rho into sensitivity_analysis to get:

max_bias=ν2se,Bias-aware CI=θ^±(max_bias+zαse). \text{max\_bias} = \sqrt{\nu^2}\,se, \qquad \text{Bias-aware CI} = \hat\theta \pm (\text{max\_bias} + z_\alpha\,se).

Small cfcf values (or ρ0\rho \approx 0) ⇒ tiny ν2\nu^2 ⇒ bias-aware CI near the sampling CI. Large cfcf values and ρ1|\rho|\approx 1 widen it, reflecting stronger plausible hidden cofounding.

Result

{'theta': 11.922156671701723, 'se': 2.3365874757263203, 'alpha': 0.05, 'z': 1.959963984540054, 'H0': 0.0, 'sampling_ci': (7.342529372550778, 16.501783970852667), 'theta_bounds_cofounding': (3.455971789602664, 20.38834155380078), 'bias_aware_ci': (-1.1816579084569545, 25.05171921299312), 'max_bias_base': 838.1523033278067, 'max_bias': 8.466184882099059, 'bound_width': 8.466184882099059, 'sigma2': 33645.29014705864, 'nu2': 20.879572757529697, 'rv': 0.014024838096781095, 'rva': 0.008684298340535173, 'params': {'r2_y': 0.01, 'r2_d': 0.01, 'rho': 1.0, 'use_signed_rr': False}}

Result
r2_yr2_drhotheta_longtheta_shortdelta
d0.0002080.046341.011.92215713.553909-1.631752