Refutation of DML IRM inference

This notebook presents the refutation flow research workflow and key analysis steps.

We will use DGP from Causalis. Read more at https://causalis.causalcraft.com/articles/generate_obs_hte_26_rich

Result

	user_id	y	d	tenure_months	avg_sessions_week	spend_last_month	age_years	income_monthly	prior_purchases_12m	support_tickets_90d	premium_user	mobile_user	urban_resident	referred_user	m	m_obs	tau_link	g0	g1	cate
0	1	0.000000	0.0	28.814654	1.0	77.936767	50.234101	1926.698301	1.0	2.0	1.0	1.0	1.0	0.0	0.045453	0.045453	0.089095	8.137981	9.142395	1.004414
1	2	80.099611	1.0	25.913345	3.0	53.777740	28.115859	5104.271509	3.0	0.0	1.0	1.0	0.0	1.0	0.041514	0.041514	0.246679	60.459257	78.817307	18.358049
2	3	6.400482	1.0	24.969929	10.0	134.764322	22.907062	5267.938255	8.0	3.0	0.0	1.0	1.0	0.0	0.052593	0.052593	0.162968	7.712855	9.138577	1.425723
3	4	2.788238	0.0	40.655089	5.0	59.517074	31.970490	6597.327018	3.0	2.0	1.0	1.0	1.0	0.0	0.036221	0.036221	0.188755	25.386510	31.159932	5.773422
4	5	0.000000	0.0	18.560899	3.0	74.370930	39.237248	4930.009628	5.0	1.0	1.0	1.0	0.0	0.0	0.036343	0.036343	0.174757	15.359250	18.600227	3.240977

Result

CausalData(df=(100000, 13), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'age_years', 'income_monthly', 'prior_purchases_12m', 'support_tickets_90d', 'premium_user', 'mobile_user', 'urban_resident', 'referred_user'])

Inference

Result

	value
field
estimand	ATTE
model	IRM
value	12.5846 (ci_abs: 7.8449, 17.3244)
value_relative	27.4047 (ci_rel: 16.6686, 38.1408)
alpha	0.0500
p_value	0.0000
is_significant	True
n_treated	4949
n_control	95051
treatment_mean	58.5062
control_mean	76.0871
time	2026-04-11

Overlap

What “overlap/positivity” means

Binary treatment $T \in \{0,1\}$ : for all confounder values $x$ in your target population,

0 < e(x) := P(T = 1 \mid X = x) < 1,

often strengthened to strong positivity: there exists an $\varepsilon > 0$ such that

\varepsilon \le e(x) \le 1 - \varepsilon \quad\text{almost surely.}

Why it matters

Identification: Overlap + unconfoundedness are the two pillars that identify causal effects from observational data. Without overlap, the effect is not identified — you must extrapolate or model-specify what never occurs.
Estimation stability: IPW/DR estimators use weights

w_1 = \frac{D}{e(X)}, \qquad w_0 = \frac{1 - D}{1 - e(X)}.

If $e(X)$ is near 0 or 1, these weights explode, causing huge variance and fragile estimates.

Target population: With trimming or restriction, you may change who the effect describes — e.g., ATE on the region of common support, not on the full population.

It's summary for overlap diagnostics

Result

	metric	value	flag
0	edge_0.01_below	0.018340	GREEN
1	edge_0.01_above	0.000000	GREEN
2	edge_0.02_below	0.112610	RED
3	edge_0.02_above	0.000000	RED
4	KS	0.202579	GREEN
5	AUC	0.632058	GREEN
6	ESS_treated_ratio	0.422758	GREEN
7	ESS_control_ratio	0.996602	GREEN
8	tails_w1_q99/med	5.022018	YELLOW
9	tails_w0_q99/med	1.180919	YELLOW
10	ATT_identity_relerr	0.010802	GREEN
11	clip_m_total	0.018340	GREEN
12	calib_ECE	0.005297	GREEN
13	calib_slope	0.690779	YELLOW
14	calib_intercept	-0.850319	RED

`edge_mass`

edge_0.01_below, edge_0.01_above, edge_0.02_below, edge_0.02_above are shares of units whose propensity is below or above the percents

To keep in mind: DML IRM is clipping out the interval [0.02 and 0.98]

Huge shares are dangerous for estimation in terms of weights exploding

Flags in Causalis:

For $ε=0.01$ : YELLOW if either side 0.02 (2%), RED if 0.05 (5%).

For $ε=0.02$ : YELLOW if either side 0.05 (5%), RED if 0.10 (10%).

Result

{'share_below_001': 0.01834, 'share_above_001': 0.0, 'share_below_002': 0.11261, 'share_above_002': 0.0, 'min_m': 0.00040258279346789335, 'max_m': 0.9182928217038309}

`ks - Kolmogorov–Smirnov statistic`

Here KS is the two-sample Kolmogorov–Smirnov statistic comparing the distributions of the propensities for treated vs control:

$D=\max_t |\hat F_A(t)-\hat F_B(t)|$

Interpretation:

(D=0): identical distributions (perfect overlap).
(D=1): complete separation (no overlap).
Your value KS = 0.5116 means there exists a threshold (t) such that the share of treated with $m\le t$ differs from the share of controls with $m\le t$ by ~51 percentage points. That’s why it’s flagged RED (your thresholds mark RED when KS > 0.35): treatment assignment is highly predictable from covariates ⇒ poor overlap / strong confounding risk.

Result

0.20257934760928364

`AUC`

Probability definition (most intuitive)

\text{AUC} = \Pr\big(s^+ > s^-\big)+\tfrac12\Pr\big(s^+ = s^-\big),

where $s^+$ is a score from a random positive and $s^-$ from a random negative. So AUC is the fraction of all $n_1 n_0$ positive–negative pairs that are correctly ordered by the score (ties get half-credit).

Rank / Mann–Whitney formulation

Rank all scores together (ascending). If there are ties, assign average ranks within each tied block.
Let $R_1$ be the sum of ranks for the positives.
Compute the Mann–Whitney (U) statistic for positives:

U_1 = R_1 - \frac{n_1(n_1+1)}{2}.

Convert to AUC by normalizing:

\boxed{\text{AUC} = \frac{U_1}{n_1 n_0} = \frac{R_1 - \frac{n_1(n_1+1)}{2}}{n_1 n_0}}

This is exactly what your function returns (with stable sorting and tie-averaged ranks).

ROC-integral view (equivalent)

If $\text{TPR}(t)$ and $\text{FPR}(t)$ trace the ROC curve as the threshold $t$ moves,

\text{AUC} = \int_0^1 \text{TPR}\big(\text{FPR}^{-1}(u)\big)du,

i.e., the geometric area under the ROC.

Properties you should remember

Range: $0 \le \text{AUC} \le 1$ ; 0.5 = random ranking; 1 = perfect separation.
Symmetry: $\text{AUC}(s,y) = 1 - \text{AUC}(s,1-y)$ .
Monotone invariance: Any strictly increasing transform $f$ leaves AUC unchanged (only ranks matter).
Ties: Averaged ranks ⇒ adds the $\tfrac12\Pr(s^+=s^-)$ term automatically.

In the propensity/overlap context

A higher AUC means treatment (D) is more predictable from covariates (bad for overlap/positivity).
For good overlap you actually want AUC close to 0.5.

Result

0.6320577921011825

`ESS_treated_ratio`

Weights used

For ATE-style IPW, the treated-arm weights are

w_i = \frac{D_i}{m_i} \quad\Rightarrow\quad w_i = \begin{cases} 1/m_i,& D_i=1\\ 0,& D_i=0 \end{cases}

so on the treated subset $\{i:D_i=1\}$ the weights are simply $1/m_i$ .

Effective sample size (ESS)

Given the treated-arm weights $w_1,\ldots,w_{n_1}$ (only for $D=1$ ),

\mathrm{ESS} = \frac{\left(\sum_{i=1}^{n_1} w_i\right)^2}{\sum_{i=1}^{n_1} w_i^2}.

This is exactly what _ess(w) computes.

If all treated weights are equal, ESS $= n_1$ (full efficiency).
If a few weights dominate, ESS drops (information concentrated in few units).

The reported metric

\boxed{ \mathrm{ESS}_{\text{treated ratio}} = \frac{\mathrm{ESS}}{n_1} = \frac{\left(\sum_i w_i\right)^2}{n_1 \sum_i w_i^2} }

This lies in $(0,1]$ . Near 1 ⇒ well-behaved weights; near 0 ⇒ severe instability.

Why it reflects overlap

When propensities $m_i$ approach 0 for treated units, weights $1/m_i$ explode → large CV → low ESS_treated_ratio. Hence this metric is a direct, quantitative read on how much usable information remains in the treated group after IPW.

Result

{'ess_w1': 2092.231021401662, 'ess_w0': 94727.97374683288, 'ess_ratio_w1': 0.4227583393416169, 'ess_ratio_w0': 0.9966015480829542}

`tails_w1_q99/med`

\boxed{ \texttt{tails\_w1\_q99/med} = \frac{Q_{0.99}(W_1)}{\mathrm{median}(W_1)}. }

Interpretation

It’s a tail-heaviness index for treated weights: how large the 99th-percentile weight is relative to a typical (median) weight.
Scale-invariant: if you re-scale weights (e.g., Hájek normalization), both numerator and denominator scale equally, so the ratio is unchanged.
Bigger $\Rightarrow$ heavier right tail $\Rightarrow$ more variance inflation for IPW (since variance depends on large $w_i^2$ ). It typically coincides with a low $ESS_{\text{treated ratio}}$ .

Edge cases & thresholds

If $\text{median}(W_1)=0$ or undefined, the ratio is not meaningful (your code returns “NA” in that case; with positive treated weights this is rare).
Defaults: YELLOW if any of $\{q95/med,q99/med,q999/med,\max/med\}$ exceeds 10; RED if any exceed 100. tails_w1_q99/med is one of these checks, focusing specifically on the 99th percentile.

Quick example

If $\mathrm{median}(W_1) = 1.2$ and $Q_{0.99}(W_1) = 46.8$ , then

\texttt{tails\_w1\_q99/med} = \frac{46.8}{1.2} \approx 39,

indicating heavy tails and a likely unstable ATE IPW.

Result

{'w1': {'q95': 53.65211231602614, 'q99': 99.13519251636046, 'q999': 296.657862294062, 'max': 972.7925056114524, 'median': 19.740109262592313}, 'w0': {'q95': 1.1255598663952755, 'q99': 1.2270701071980552, 'q999': 1.525390008647933, 'max': 12.238826757365642, 'median': 1.0390802580120477}}

`ATT_identity_relerr`

With estimated propensities $m_i=\hat m(X_i)$ and $D_i\in\{0,1\}$ :

Left-hand side (controls odds sum):

\text{LHS} = \sum_{i=1}^n (1-D_i)\frac{m_i}{1-m_i}.

Right-hand side (treated count):

\text{RHS} = \sum_{i=1}^n D_i = n_1.

If $\hat m\approx m$ and overlap is ok, LHS $\approx$ RHS.

You report the relative error:

\boxed{ \texttt{ATT\_identity\_relerr} = \frac{\big|\mathrm{LHS} - \mathrm{RHS}\big|}{\mathrm{RHS}} }

(when $n_1>0$ otherwise it’s set to $\infty$ ).

How to read the number

Small $\texttt{relerr}$ (e.g., $\le 5%$ ) ⇒ propensities are reasonably calibrated (especially on the control side) and ATT weights won’t be wildly off in total mass.
Large $\texttt{relerr}$ ⇒ possible miscalibration of $\hat m$ (e.g., over/underestimation for controls), poor overlap (many controls with $m_i\to 1$ inflating $m_i/(1-m_i))$ , or clipping/trimming effects.

Your default flags (same as in the code):

GREEN if $\texttt{relerr} \le 0.05$
YELLOW if $0.05 < \texttt{relerr} \le 0.10$
RED if $> 0.10$

Quick intuition

The term $m/(1-m)$ is the odds of treatment. Summing that over controls should reconstruct the treated count. If it doesn’t, either the odds are off (propensity miscalibration) or the data lack support where you need it—both are red flags for ATT-IPW stability.

Result

{'lhs_sum': 4895.5420181049985, 'rhs_sum': 4949.0, 'rel_err': 0.010801774478682859}

`clip_m_total`

look at edge_mass

Result

{'m_clip_lower': 0.01834, 'm_clip_upper': 0.0}

`calib_ECE, calib_slope, calib_intercept`

calib_ECE = 0.018 (GREEN)

Math: with 10 equal-width bins,

\text{ECE}=\sum_{k=1}^{10}\frac{n_k}{n},|\bar y_k-\bar p_k|

(weighted average gap between observed rate (\bar y_k) and mean prediction (\bar p_k) per bin). Result: ~1.8% average miscalibration → overall probabilities track outcomes well. Note the biggest bin error is in 0.5–0.6 (abs_error ≈ 0.162) but it’s tiny (95/10,000), so ECE stays low.

calib_slope (β) = 0.889 (GREEN)

Math (logistic recalibration):

\Pr(D=1\mid p)=\sigma(\alpha+\beta,\text{logit}(p)).

Interpretation: (\beta<1) ⇒ predictions are a bit over-confident (too extreme); the optimal calibration slightly flattens them toward 0.5.

calib_intercept (α) = −0.107 (GREEN)

Math: same model as above; (\alpha) is a vertical shift on the log-odds scale. Interpretation: Negative $\alpha$ nudges probabilities downward overall (your model is, on average, a bit high), consistent with bins like 0.5–0.6 where $\bar p_k > \bar y_k$ .

All three fall well within your GREEN thresholds, so calibration looks solid despite minor mid-range overprediction.

Result

{'n': 100000, 'n_bins': 10, 'auc': 0.6320577921011825, 'brier': 0.04665718128066525, 'ece': 0.005297225122785985, 'reliability_table': bin lower upper count mean_p frac_pos abs_error 0 0 0.0 0.1 92827 0.040477 0.044039 0.003562 1 1 0.1 0.2 6310 0.131267 0.115214 0.016053 2 2 0.2 0.3 681 0.235178 0.165932 0.069246 3 3 0.3 0.4 122 0.337790 0.131148 0.206642 4 4 0.4 0.5 33 0.448932 0.030303 0.418629 5 5 0.5 0.6 21 0.543950 0.142857 0.401093 6 6 0.6 0.7 3 0.626228 0.000000 0.626228 7 7 0.7 0.8 2 0.712396 0.500000 0.212396 8 8 0.8 0.9 0 NaN NaN NaN 9 9 0.9 1.0 1 0.918293 0.000000 0.918293, 'recalibration': {'intercept': -0.8503193507018745, 'slope': 0.6907789772516298}, 'flags': {'ece': 'GREEN', 'slope': 'YELLOW', 'intercept': 'RED'}}

Result

	bin	lower	upper	count	mean_p	frac_pos	abs_error
0	0	0.0	0.1	92827	0.040477	0.044039	0.003562
1	1	0.1	0.2	6310	0.131267	0.115214	0.016053
2	2	0.2	0.3	681	0.235178	0.165932	0.069246
3	3	0.3	0.4	122	0.337790	0.131148	0.206642
4	4	0.4	0.5	33	0.448932	0.030303	0.418629
5	5	0.5	0.6	21	0.543950	0.142857	0.401093
6	6	0.6	0.7	3	0.626228	0.000000	0.626228
7	7	0.7	0.8	2	0.712396	0.500000	0.212396
8	8	0.8	0.9	0	NaN	NaN	NaN
9	9	0.9	1.0	1	0.918293	0.000000	0.918293

Score

We need this score refutation tests for:

Catch overfitting/leakage: The out-of-sample moment check verifies that the AIPW score averages to ~0 on held-out folds using fold-specific θ and nuisances. If this fails, your effect can be an artifact of leakage or overfit learners rather than a real signal.
Verify Neyman orthogonality in practice: The Gateaux-derivative tests (orthogonality_derivatives) check that small, targeted perturbations to the nuisances (g₀, g₁, m) don’t move the score mean. Large |t| values flag miscalibration (e.g., biased propensity or outcome models) that breaks the orthogonality protection DML relies on.
Assess finite-sample stability: The influence diagnostics reveal heavy tails (p99/median, kurtosis) and top-influential points. Spiky ψ implies high variance and sensitivity—often due to near-0/1 propensities, poor overlap, or outliers.
ATTE-specific risks: For ATT/ATTE, only g₀ and m matter in the score. The added overlap metrics and trim curves show how reliant your estimate is on scarce, high-m controls—common failure mode for ATT.

Result

	metric	value	flag
0	psi_p99_over_med	65.892735	RED
1	psi_kurtosis	2961.129817	RED
2	max_\|t\|_g0	0.756304	GREEN
3	max_\|t\|_m	1.460759	GREEN
4	oos_max_abs_t	0.000043	GREEN

`psi_p99_over_med`

Let $\psi_i$ be the per-unit influence value (EIF score) for your estimator. We look at magnitudes $a_i \equiv |\psi_i|$ .
Define the 99th percentile and the median of these magnitudes:

q_{0.99} \equiv \operatorname{Quantile}_{0.99}(a_1,\dots,a_n),\qquad m \equiv \operatorname{median}(a_1,\dots,a_n).

The metric is the scale-free tail ratio:

\boxed{ \texttt{psi\_p99\_over\_med} = \frac{q_{0.99}}{m} }

Why this works (brief):

Uses $|\psi_i|$ to ignore sign (only tail size matters).
Dividing by the median makes it scale-invariant and robust to a few large values.
Large values $\big(\gg 1\big)$ mean a small fraction of observations dominate uncertainty (heavy tails → unstable SE).

Quick read:

$\approx 1!-!5$ : tails tame/stable
$\gtrsim 10$ : caution (heavy tails)
$\gtrsim 20$ : likely unstable; check overlap, trim/clamp propensities, or robustify learners.

Result

{'se_plugin': 2.418298817276679, 'kurtosis': 2961.129816779129, 'p99_over_med': 65.89273487541587, 'top_influential': i psi m res_t res_c 0 67446 73778.785159 0.014117 4644.442019 0.000000 1 65005 58614.257560 0.013624 3571.065959 0.000000 2 3618 57000.781935 0.033001 2898.123279 0.000000 3 82138 56390.739144 0.918293 0.000000 -248.315749 4 79652 -56368.720054 0.622473 0.000000 1691.929777 5 96126 52969.715324 0.045853 2456.754242 0.000000 6 48708 45679.464441 0.026318 2322.336319 0.000000 7 87181 43912.764287 0.085741 2020.184196 0.000000 8 91410 -39500.894797 0.019123 -1225.644978 -0.000000 9 57635 34512.765185 0.072775 1785.542418 0.000000}

`psi_kurtosis`

Let $\psi_i$ be the per-unit influence values and define centered residuals

\tilde\psi_i \equiv \psi_i - \bar\psi,\qquad \bar\psi \equiv \frac{1}{n}\sum_{i=1}^n \psi_i.

Sample variance (with Bessel correction):

s^2 \equiv \frac{1}{n-1}\sum_{i=1}^n \tilde\psi_i^2.

Sample 4th central moment:

\hat\mu_4 \equiv \frac{1}{n}\sum_{i=1}^n \tilde\psi_i^4.

The reported metric (raw kurtosis, not excess):

\boxed{ \texttt{psi\_kurtosis} = \frac{\hat{\mu}_4}{s^4} }

Interpretation (quick):

Normal reference $\approx 3$ (excess kurtosis $=0$ ).
Much larger $\Rightarrow$ heavier tails / more extreme $\psi_i$ outliers.
Rules of thumb used in the diagnostics: $\ge 10$ = caution, $\ge 30$ = severe.

Result

$max_|t|_g1$ , $max_|t|_g0$ , $max_|t|_m$

We work with a basis of functions

\{h_b(X)\}_{b=0}^{B-1} \quad\text{(columns of }X_{\text{basis}}\text{; }h_0\equiv 1\text{ is the constant).}

Let $m_i^\tau \equiv \mathrm{clip}(m_i,\tau,1-\tau)$ be the clipped propensity (guards against division by zero).

ATE case

For each basis function $b$ , form a sample mean (Gateaux derivative estimator) and its standard error, then compute a t-statistic; finally take the maximum absolute value across bases.

$g_1$ direction

\widehat d_{g_1,b} = \frac{1}{n}\sum_{i=1}^n h_b(X_i) \Big(1 - \frac{D_i}{m_i^\tau}\Big), \qquad \mathrm{se}(\widehat d_{g_1,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i) \left(1-\frac{D_i}{m_i^\tau}\right)\right]}{\sqrt{n}}.

t_{g_1,b} = \frac{\widehat d_{g_1,b}}{\mathrm{se}(\widehat d_{g_1,b})}, \qquad \boxed{ \max_{|t|_{g_1}} = \max_b |t_{g_1,b}| }.

$g_0$ direction

\widehat d_{g_0,b} = \frac{1}{n}\sum_{i=1}^n h_b(X_i) \Big(\frac{1-D_i}{1-m_i^\tau} - 1\Big), \qquad \mathrm{se}(\widehat d_{g_0,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i) \left(\frac{1-D_i}{1-m_i^\tau} - 1\right)\right]}{\sqrt{n}}.

t_{g_0,b} = \frac{\widehat d_{g_0,b}}{\mathrm{se}(\widehat d_{g_0,b})}, \qquad \boxed{ \max_{|t|_{g_0}} = \max_b |t_{g_0,b}| }.

$m$ direction

S_i \equiv \frac{D_i(Y_i - g_{1,i})}{(m_i^\tau)^2} + \frac{(1-D_i)(Y_i - g_{0,i})}{(1 - m_i^\tau)^2}.

\widehat d_{m,b} = -\frac{1}{n}\sum_{i=1}^n h_b(X_i) S_i, \qquad \mathrm{se}(\widehat d_{m,b}) = \frac{\operatorname{sd}\!\left[h_b(X_i)S_i\right]}{\sqrt{n}}.

t_{m,b} = \frac{\widehat d_{m,b}}{\mathrm{se}(\widehat d_{m,b})}, \qquad \boxed{ \max_{|t|_{m}} = \max_b |t_{m,b}| }.

Interpretation: under Neyman orthogonality, each derivative mean $\widehat d_{\bullet,b}$ should be approximately zero, so all $|t_{\bullet,b}|$ should be small. Large $\max_{|t|}$ values flag miscalibration of the corresponding nuisance.

ATTE / ATT case

Let $p_1 = \mathbb{E}[D]$ and define the odds $o_i = m_i^\tau / (1 - m_i^\tau)$ .

The $g_1$ derivative is identically zero:

\Rightarrow\quad \max_{|t|_{g_1}} = 0.

$g_0$ direction

\widehat d_{g_0,b} = \frac{1}{n}\sum_i h_b(X_i)\frac{(1-D_i)o_i - D_i}{p_1}, \qquad t_{g_0,b} = \frac{\widehat d_{g_0,b}}{\mathrm{se}(\widehat d_{g_0,b})}, \qquad \max_{|t|_{g_0}} = \max_b |t_{g_0,b}|.

$m$ direction

\widehat d_{m,b} = -\frac{1}{n}\sum_i h_b(X_i) \frac{(1-D_i)(Y_i - g_{0,i})} {p_1(1 - m_i^\tau)^2}, \qquad \max_{|t|_{m}} = \max_b |t_{m,b}|.

Rule of thumb: $\max_{|t|} \lesssim 2$ is “okay”; larger values indicate orthogonality breakdown — fix by recalibrating that nuisance, changing learners, features, or trimming.

Result

	basis	d_g0	se_g0	t_g0	d_m	se_m	t_m
0	0	-0.009667	0.015066	-0.641646	8.993647	14.326830	0.627749
1	1	0.005619	0.014770	0.380461	9.721589	31.390969	0.309694
2	2	-0.008142	0.014233	-0.572052	-19.058295	37.804969	-0.504121
3	3	-0.007135	0.013177	-0.541485	-11.588240	27.491526	-0.421520
4	4	0.003389	0.016150	0.209831	5.925403	24.445802	0.242389
5	5	-0.010344	0.014159	-0.730568	-18.600250	30.534649	-0.609152
6	6	-0.001328	0.014712	-0.090265	-16.123197	17.612723	-0.915429
7	7	-0.012661	0.018051	-0.701392	10.629727	13.467501	0.789287
8	8	-0.000087	0.018272	-0.004756	-21.971273	15.040994	-1.460759
9	9	0.011449	0.015138	0.756304	-9.586718	10.559739	-0.907856
10	10	0.008763	0.015200	0.576521	-1.273167	13.045118	-0.097597
11	11	-0.003715	0.015581	-0.238425	-12.838603	13.878071	-0.925100

`oos_tstat_fold, oos_tstat_strict`

Here’s the math behind the two OOS (out-of-sample) moment t-stats used in the diagnostics. Assume K-fold cross-fitting with held-out index sets $I_k$ (size $n_k$ ) and complements $R_k$ .

Step 1 — Leave-fold-out $\hat\theta_{-k}$

For the moment condition $\mathbb{E}[\psi_a(W)\,\theta + \psi_b(W)] = 0$ , the leave-fold-out estimate used on fold $k$ is

\hat\theta_{-k} = -\frac{\bar\psi_{b,R_k}}{\bar\psi_{a,R_k}}, \qquad \bar\psi_{\cdot,R_k} = \frac{1}{|R_k|}\sum_{i\in R_k}\psi_{\cdot}(W_i).

Step 2 — Held-out scores on fold $k$

Define the fold-specific held-out score for $i\in I_k$ :

\psi_i^{(k)} = \psi_b(W_i) + \psi_a(W_i)\,\hat\theta_{-k}.

Compute per-fold mean and variance:

\bar\psi_k = \frac{1}{n_k}\sum_{i\in I_k}\psi_i^{(k)}, \qquad s_k^2 = \frac{1}{n_k-1}\sum_{i\in I_k}\!\big(\psi_i^{(k)}-\bar\psi_k\big)^2.

OOS t-stat diagnostics

$\texttt{oos\_tstat\_fold}$

A fold-aggregated, variance-weighted t-statistic:

\boxed{ \texttt{oos\_tstat\_fold} = \frac{\displaystyle \sum_{k=1}^K n_k\,\bar\psi_k} {\displaystyle \sqrt{\sum_{k=1}^K n_k\,s_k^2}} }

Intuition: averages fold means and scales by a fold-pooled standard error.

$\texttt{oos\_tstat\_strict}$

A “strict” t-stat using every held-out observation directly:

N = \sum_{k=1}^K n_k, \qquad \bar\psi_{\text{all}} = \frac{1}{N}\sum_{k=1}^K\sum_{i\in I_k}\psi_i^{(k)}.

s_{\text{all}}^2 = \frac{1}{N-1} \sum_{k=1}^K\sum_{i\in I_k} \big(\psi_i^{(k)} - \bar\psi_{\text{all}}\big)^2.

\boxed{ \texttt{oos\_tstat\_strict} = \frac{\bar\psi_{\text{all}}}{s_{\text{all}}/\sqrt{N}} }

Intuition: computes a single overall mean and standard error across all held-out scores (often slightly more conservative).

Interpretation

Under a valid design and correct cross-fitting (so that $\mathbb{E}[\psi]=0$ out-of-sample), both statistics are approximately standard normal:

\text{two-sided p-value} \approx 2\big(1 - \Phi(|t|)\big).

Values near $0$ indicate that the moment condition holds out of sample. Large $|t|$ suggests overfitting, leakage, or nuisance miscalibration.

Result

{'available': True, 'oos_tstat_fold': 4.30372951140105e-05, 'oos_tstat_strict': 4.303756024553089e-05, 'p_value_fold': 0.9999656612067004, 'p_value_strict': 0.999965660995156, 'fold_table': fold n theta_minus_k psi_mean psi_var 0 0 20000 12.996727 -2.060429 341848.063350 1 1 20000 13.027849 -2.216040 598919.118865 2 2 20000 12.624229 -0.197940 625680.129850 3 3 20000 11.025878 7.793816 870665.088524 4 4 20000 13.248418 -3.318886 487607.995225}

Result

{'params': {'score': 'ATTE', 'trimming_threshold': 0.01, 'normalize_ipw': False}, 'orthogonality_derivatives': basis d_g1 se_g1 t_g1 d_g0 se_g0 t_g0 d_m
0 0 0.0 0.0 0.0 -0.009667 0.015066 -0.641646 8.993647
1 1 0.0 0.0 0.0 0.005619 0.014770 0.380461 9.721589
2 2 0.0 0.0 0.0 -0.008142 0.014233 -0.572052 -19.058295
3 3 0.0 0.0 0.0 -0.007135 0.013177 -0.541485 -11.588240
4 4 0.0 0.0 0.0 0.003389 0.016150 0.209831 5.925403
5 5 0.0 0.0 0.0 -0.010344 0.014159 -0.730568 -18.600250
6 6 0.0 0.0 0.0 -0.001328 0.014712 -0.090265 -16.123197
7 7 0.0 0.0 0.0 -0.012661 0.018051 -0.701392 10.629727
8 8 0.0 0.0 0.0 -0.000087 0.018272 -0.004756 -21.971273
9 9 0.0 0.0 0.0 0.011449 0.015138 0.756304 -9.586718
10 10 0.0 0.0 0.0 0.008763 0.015200 0.576521 -1.273167
11 11 0.0 0.0 0.0 -0.003715 0.015581 -0.238425 -12.838603

se_m t_m
0 14.326830 0.627749
1 31.390969 0.309694
2 37.804969 -0.504121
3 27.491526 -0.421520
4 24.445802 0.242389
5 30.534649 -0.609152
6 17.612723 -0.915429
7 13.467501 0.789287
8 15.040994 -1.460759
9 10.559739 -0.907856
10 13.045118 -0.097597
11 13.878071 -0.925100 , 'influence_diagnostics': {'se_plugin': 2.418298817276679, 'kurtosis': 2961.129816779129, 'p99_over_med': 65.89273487541587, 'top_influential': i psi m res_t res_c 0 67446 73778.785159 0.014117 4644.442019 0.000000 1 65005 58614.257560 0.013624 3571.065959 0.000000 2 3618 57000.781935 0.033001 2898.123279 0.000000 3 82138 56390.739144 0.918293 0.000000 -248.315749 4 79652 -56368.720054 0.622473 0.000000 1691.929777 5 96126 52969.715324 0.045853 2456.754242 0.000000 6 48708 45679.464441 0.026318 2322.336319 0.000000 7 87181 43912.764287 0.085741 2020.184196 0.000000 8 91410 -39500.894797 0.019123 -1225.644978 -0.000000 9 57635 34512.765185 0.072775 1785.542418 0.000000}, 'oos_moment_test': {'available': True, 'oos_tstat_fold': 4.30372951140105e-05, 'oos_tstat_strict': 4.303756024553089e-05, 'p_value_fold': 0.9999656612067004, 'p_value_strict': 0.999965660995156, 'fold_table': fold n theta_minus_k psi_mean psi_var 0 0 20000 12.996727 -2.060429 341848.063350 1 1 20000 13.027849 -2.216040 598919.118865 2 2 20000 12.624229 -0.197940 625680.129850 3 3 20000 11.025878 7.793816 870665.088524 4 4 20000 13.248418 -3.318886 487607.995225}, 'flags': {'psi_tail_ratio': 'RED', 'psi_kurtosis': 'RED', 'ortho_max_|t|g1': 'GREEN', 'ortho_max|t|g0': 'GREEN', 'ortho_max|t|m': 'GREEN', 'oos_moment': 'GREEN'}, 'thresholds': {'tail_ratio_warn': 10.0, 'tail_ratio_strong': 20.0, 'kurt_warn': 10.0, 'kurt_strong': 30.0, 't_warn': 2.0, 't_strong': 4.0}, 'overall_flag': 'RED', 'meta': {'n': 100000, 'score': 'ATTE', 'used_estimator_psi': True, 'uses_custom_weights': False}, 'summary': metric value flag 0 psi_p99_over_med 65.892735 RED 1 psi_kurtosis 2961.129817 RED 2 max|t|g0 0.756304 GREEN 3 max|t|_m 1.460759 GREEN 4 oos_max_abs_t 0.000043 GREEN}

SUTVA

Result

1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?

Those assumptions are statistically untestable. We need design of research for them

Unconfoundedness

Result

	metric	value	flag
0	balance_max_smd	0.01463	GREEN
1	balance_frac_violations	0.00000	GREEN

`balance\_max\_smd`

For each covariate $X_j$ , the (weighted) standardized mean difference is

\mathrm{SMD}_j = \frac{\big|\mu_{1j} - \mu_{0j}\big|} {\sqrt{\tfrac{1}{2}\big(\sigma_{1j}^2 + \sigma_{0j}^2\big)}}.

Group means and variances are computed under the IPW weights implied by your estimand:

ATE: $w_{1i} = \tfrac{D_i}{\hat m_i}$ , $w_{0i} = \tfrac{1-D_i}{1-\hat m_i}$
ATTE: $w_{1i} = D_i$ , $w_{0i} = (1-D_i)\tfrac{\hat m_i}{1-\hat m_i}$

(If normalize=True, each weight vector is divided by its mean.)

Weighted means and variances:

\mu_{gj} = \frac{\sum_i w_{gi} X_{ij}}{\sum_i w_{gi}}, \qquad \sigma_{gj}^2 = \frac{\sum_i w_{gi}(X_{ij} - \mu_{gj})^2}{\sum_i w_{gi}}, \qquad g \in \{0,1\}.

Special cases in the code:

If both variances are $\approx 0$ and $|\mu_{1j}-\mu_{0j}| \approx 0$ ⇒ $\mathrm{SMD}_j = 0$
If both variances are $\approx 0$ but means differ ⇒ $\mathrm{SMD}_j = \infty$
If denominator is $\approx 0$ otherwise ⇒ $\mathrm{SMD}_j = \text{NaN}$

Then

\textbf{balance\_max\_smd} = \max_j \mathrm{SMD}_j,

implemented as a nanmax over the vector of $\mathrm{SMD}_j$ . NaNs are ignored; if any feature produced $\infty$ , the max is $\infty$ .

`balance\_frac\_violations`

Let the SMD threshold be $\tau$ (default $0.10$ ). Define the set of finite SMDs:

\mathcal{J} = \{ j : \ \mathrm{SMD}_j \text{ is finite} \}.

Then the fraction of violations is

\textbf{balance\_frac\_violations} = \frac{1}{|\mathcal{J}|} \sum_{j \in \mathcal{J}} \mathbf{1}\{ \mathrm{SMD}_j \ge \tau \}.

So it’s the share of covariates whose weighted SMD exceeds the threshold, computed only over finite SMDs (NaN / Inf are excluded from the denominator).

Quick interpretation

Smaller is better. A common rule of thumb is $\mathrm{SMD} \le 0.10$ .
balance_max_smd tells you the worst residual imbalance across covariates.
balance_frac_violations tells you how many covariates (as a fraction) still exceed the chosen threshold.

`Sensitivity analysis`

1) `sensitivity_analysis`: bias-aware CI

Goal. Start from your estimator $\hat\theta$ with sampling standard error $se$ . Allow a controlled amount of worst-case hidden cofounding through three knobs $cf_y, cf_d, \rho$ . Inflate the uncertainty by an additive “max bias”.

Step A — Sampling part

Point estimate $\hat\theta$ , standard error $se$ , and $z_\alpha$ for level $1-\alpha$ .
Usual sampling CI:

[\,\hat\theta - z_\alpha\,se,\ \hat\theta + z_\alpha\,se\,].

Step B — Cofounding geometry

The code pulls sensitivity elements from the fitted IRM:

$\sigma^2$ : the asymptotic variance of the estimator’s EIF (so that $se = \sqrt{\sigma^2}$ in the module’s normalization).
$m_\alpha(i) \ge 0$ : per-unit weight for the outcome channel (how outcome-model misspecification moves the EIF).
$r(i)$ (“riesz_rep”): per-unit weight for the treatment channel (how propensity-model misspecification moves the EIF).

We turn the user’s sensitivity knobs into a quadratic budget for adversarial cofounding:

\begin{aligned} a_i &:= \sqrt{2\,m_\alpha(i)}, \\[1mm] b_i &:= \begin{cases} |r(i)|, & \text{(default, worst-case sign)} \\ r(i), & \text{(if \texttt{use\_signed\_rr=True})} \end{cases} \\[2mm] \text{base}_i &= a_i^2\,cf_y + b_i^2\,cf_d + 2\,\rho\,\sqrt{cf_y\,cf_d}\,a_i b_i \ge 0, \\[2mm] \nu^2 &:= \mathbb{E}_n[\text{base}_i]. \end{aligned}

$cf_y \ge 0$ : strength of unobserved outcome disturbance
$cf_d \ge 0$ : strength of unobserved treatment disturbance
$\rho \in [-1,1]$ : their correlation

This $\nu^2$ is a dimensionless bias multiplier — how sensitive the EIF is to those perturbations.

Step C — Max bias and intervals

Two equivalent forms appear in the code:

\text{max\_bias} = \sqrt{\sigma^2\,\nu^2} = \big(\sqrt{\nu^2}\big)\,se.

Then the module reports:

Cofounding bounds for $\theta$ :

[\,\hat\theta - \text{max\_bias},\; \hat\theta + \text{max\_bias}\,].

Bias-aware CI (sampling + cofounding, worst-case additive):

\Big[\,\hat\theta - (\text{max\_bias} + z_\alpha\,se),\; \hat\theta + (\text{max\_bias} + z_\alpha\,se)\,\Big].

(So you’re adding sampling error and the adversarial bias linearly for a conservative envelope.)

Notes & edge handling

Numeric PSD clamping ensures $\text{base}_i \ge 0$ ; $\rho$ is clipped to $[-1,1]$ .
If $cf_y = cf_d = 0 \Rightarrow \nu^2 = 0 \Rightarrow$ bias-aware CI collapses to the sampling CI.
Internally, a delta-method IF for $\text{max\_bias}$ is

\psi_{\text{max}}(i) = \frac{\sigma^2\,\psi_{\nu^2}(i) + \nu^2\,\psi_{\sigma^2}(i)} {2\,\text{max\_bias}},

matching $\text{max\_bias} = \sqrt{\sigma^2\nu^2}$ (used for coherent summaries).

2) `sensitivity_benchmark`: calibrating $cf_y, cf_d, \rho$ from omitted covariates

Goal. Pick a set $Z$ of candidate “omitted” covariates (the benchmarking_set). Refit a short IRM that excludes $Z$ and compare it to the long (original) model. Use how well $Z$ explains residual variation to derive plausible $cf_y, cf_d, \rho$ .

Step A — Long vs short estimates

Long: $\hat\theta_{\text{long}}$ (original model).
Short: $\hat\theta_{\text{short}}$ (drop $Z$ , same learners/hyperparams).
Report $\Delta = \hat\theta_{\text{long}} - \hat\theta_{\text{short}}$ .

Step B — Residuals from the long model

Let $g_1, g_0, \hat m$ be the outcome and propensity learners:

r_y := Y - \big(D g_1 + (1-D) g_0\big), \qquad r_d := D - \hat m.

These are the EIF’s outcome and treatment residual components.

Step C — How much of each residual does $Z$ explain?

Regress $r_y$ on $Z$ and $r_d$ on $Z$ (unweighted OLS; ATT case uses ATT weights):

Obtain $R^2_y$ and $R^2_d$ .
Convert to signal-to-noise ratios (the “strength” of cofounding channels):

cf_y = \frac{R^2_y}{1 - R^2_y}, \qquad cf_d = \frac{R^2_d}{1 - R^2_d}.

(These are the same $R^2 / (1 - R^2)$ maps used in modern partial- $R^2$ robustness frameworks.)

Compute the correlation between the fitted pieces from those two regressions:

\rho = \operatorname{corr}\!\big(\widehat r_y(Z),\ \widehat r_d(Z)\big),

weighted for ATT when applicable, then clipped to $[-1,1]$ .

Outputs

A one-row DataFrame (indexed by the treatment name) with

\{\, cf_y,\ cf_d,\ \rho,\ \hat\theta_{\text{long}},\ \hat\theta_{\text{short}},\ \Delta \,\}.

You can pass $cf_y, cf_d, \rho$ straight into sensitivity_analysis to get the associated bias-aware interval. Intuitively, this calibrates how strong hidden stuff would need to be by using a concrete, observed proxy $Z$ .

How to read them together

Use sensitivity_benchmark with a plausible omitted set $Z$ to derive $cf_y, cf_d, \rho$ and observe the actual estimate shift $\Delta$ .
Plug those $cf_y, cf_d, \rho$ into sensitivity_analysis to get:

\text{max\_bias} = \sqrt{\nu^2}\,se, \qquad \text{Bias-aware CI} = \hat\theta \pm (\text{max\_bias} + z_\alpha\,se).

Small $cf$ values (or $\rho \approx 0$ ) ⇒ tiny $\nu^2$ ⇒ bias-aware CI near the sampling CI. Large $cf$ values and $|\rho|\approx 1$ widen it, reflecting stronger plausible hidden cofounding.

Result

================== Bias-aware Interval ==================

------------------ Scenario ------------------ Significance Level: alpha=0.05 Null Hypothesis: H0=0.0 Sensitivity parameters: r2_y=0.01; r2_d=0.01, rho=1.0, use_signed_rr=False

statistics value 0 bias_aware_ci [-0.5432, 25.6648] 1 theta [4.29, 12.5846, 20.8793] 2 sampling_ci [7.8449, 17.3244] 3 rv 0.015100 4 rva 0.009500 5 se 2.418300 6 max_bias 8.294700 7 max_bias_base 821.171600 8 bound_width 8.294700 9 sigma2 33148.799800 10 nu2 20.342300

Result

	benchmark_confounder	r2_y	r2_d	rho	theta_long	theta_short	delta
0	tenure_months	0.000028	0.041649	1.0	12.584641	11.645093	0.939548

Refutation of DML IRM inference

Refutation of DML IRM inference

Inference

Overlap

What “overlap/positivity” means

Why it matters

edge_mass

ks - Kolmogorov–Smirnov statistic

AUC

Probability definition (most intuitive)

Rank / Mann–Whitney formulation

ROC-integral view (equivalent)

Properties you should remember

In the propensity/overlap context

ESS_treated_ratio

Weights used

Effective sample size (ESS)

The reported metric

Why it reflects overlap

tails_w1_q99/med

Interpretation

Edge cases & thresholds

Quick example

ATT_identity_relerr

How to read the number

Quick intuition

clip_m_total

calib_ECE, calib_slope, calib_intercept

calib_ECE = 0.018 (GREEN)

calib_slope (β) = 0.889 (GREEN)

calib_intercept (α) = −0.107 (GREEN)

Score

psi_p99_over_med

psi_kurtosis

max∣t∣g1max_|t|_g1max∣​t∣g​1, max∣t∣g0max_|t|_g0max∣​t∣g​0, max∣t∣mmax_|t|_mmax∣​t∣m​

ATE case

g1g_1g1​ direction

g0g_0g0​ direction

mmm direction

ATTE / ATT case

oos_tstat_fold, oos_tstat_strict

Step 1 — Leave-fold-out θ^−k\hat\theta_{-k}θ^−k​

Step 2 — Held-out scores on fold kkk

OOS t-stat diagnostics

oos_tstat_fold\texttt{oos\_tstat\_fold}oos_tstat_fold

oos_tstat_strict\texttt{oos\_tstat\_strict}oos_tstat_strict

Interpretation

SUTVA

Unconfoundedness

balance\_max\_smd

balance\_frac\_violations

Quick interpretation

Sensitivity analysis

1) sensitivity_analysis: bias-aware CI

Step A — Sampling part

Step B — Cofounding geometry

Step C — Max bias and intervals

2) sensitivity_benchmark: calibrating cfy,cfd,ρcf_y, cf_d, \rhocfy​,cfd​,ρ from omitted covariates

Step A — Long vs short estimates

Step B — Residuals from the long model

Step C — How much of each residual does ZZZ explain?

Outputs

How to read them together

`edge_mass`

`ks - Kolmogorov–Smirnov statistic`

`AUC`

`ESS_treated_ratio`

`tails_w1_q99/med`

`ATT_identity_relerr`

`clip_m_total`

`calib_ECE, calib_slope, calib_intercept`

`psi_p99_over_med`

`psi_kurtosis`

$max_|t|_g1$ , $max_|t|_g0$ , $max_|t|_m$

$g_1$ direction

$g_0$ direction

$m$ direction

`oos_tstat_fold, oos_tstat_strict`

Step 1 — Leave-fold-out $\hat\theta_{-k}$

Step 2 — Held-out scores on fold $k$

$\texttt{oos\_tstat\_fold}$

$\texttt{oos\_tstat\_strict}$

`balance\_max\_smd`

`balance\_frac\_violations`

`Sensitivity analysis`

1) `sensitivity_analysis`: bias-aware CI

2) `sensitivity_benchmark`: calibrating $cf_y, cf_d, \rho$ from omitted covariates

Step C — How much of each residual does $Z$ explain?