Research8 min read

Refutation flow for multi-treatment IRM

Automated conversion of refutation_flow_multi_unconfoundedness.ipynb

Refutation flow for multi-treatment IRM

This notebook mirrors refutation_flow.ipynb, but for the causalis.scenarios.multi_unconfoundedness scenario.

We estimate pairwise ATE contrasts against the baseline treatment d_0, then run:

  • overlap diagnostics,
  • score diagnostics,
  • unconfoundedness balance checks,
  • sensitivity analysis.
Result
yd_0d_1d_2tenure_monthsavg_sessions_weekspend_last_monthpremium_userurban_residentsupport_tickets_q...m_obs_d_1tau_link_d_1m_d_2m_obs_d_2tau_link_d_2g_d_0g_d_1g_d_2cate_d_1cate_d_2
01.7240811.00.00.027.6566053.19866789.6094640.01.00.0...0.245737-0.3520050.2207390.2207390.4941663.2793842.3063145.375338-0.9730702.095954
10.6584360.01.00.023.7983863.362415102.3372360.00.03.0...0.178640-0.3073600.2368300.2368300.4202782.8078502.0648534.274630-0.7429971.466780
23.8949510.01.00.028.4250093.391819102.6607120.01.01.0...0.209711-0.3201890.2181580.2181580.5024153.0699192.2287985.073677-0.8411212.003758
32.3632040.01.00.018.8600664.07117583.5934170.00.02.0...0.175985-0.3162410.2375080.2375080.4416772.7168051.9802344.225485-0.7365711.508680
46.2324630.00.01.017.8530873.14007579.2098700.01.01.0...0.231590-0.3501300.2469730.2469730.4936243.2243542.2718695.282273-0.9524852.057919

5 rows × 26 columns

Result

MultiCausalData(df=(20000, 12), treatment_names=['d_0', 'd_1', 'd_2'], control_treatment='d_0')outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'premium_user', 'urban_resident', 'support_tickets_q', 'discount_eligible', 'credit_utilization'], user_id=None,

Inference

Result
d_1 vs d_0d_2 vs d_0
field
estimandATEATE
modelMultiTreatmentIRMMultiTreatmentIRM
value-1.2254 (ci_abs: -1.3296, -1.1212)2.5704 (ci_abs: 2.3562, 2.7846)
value_relative-31.0949 (ci_rel: -33.3772, -28.8125)65.2240 (ci_rel: 59.3040, 71.1441)
alpha0.05000.0500
p_value0.00000.0000
is_significantTrueTrue
n_treated50034877
n_control1012010120
treatment_mean2.91126.5755
control_mean3.76333.7633
time2026-02-222026-02-22

Overlap

For multi-treatment IRM, overlap is checked pairwise: baseline d_0 vs each active treatment. Key metrics are reported by comparison (d_0 vs d_1, d_0 vs d_2, ...).

Result
comparisonmetricvalueflag
0d_0 vs d_1edge_0.01_below0.000066GREEN
1d_0 vs d_1edge_0.01_above0.0GREEN
2d_0 vs d_1KS0.100367GREEN
3d_0 vs d_1AUC0.566421GREEN
4d_0 vs d_1ESS_treated_ratio0.721938GREEN
5d_0 vs d_1ESS_baseline_ratio0.896655GREEN
6d_0 vs d_1clip_m_total0.000066GREEN
7d_0 vs d_1overlap_passTrueGREEN
8d_0 vs d_2edge_0.01_below0.0GREEN
9d_0 vs d_2edge_0.01_above0.0GREEN
10d_0 vs d_2KS0.056139GREEN
11d_0 vs d_2AUC0.533669GREEN
12d_0 vs d_2ESS_treated_ratio0.814103GREEN
13d_0 vs d_2ESS_baseline_ratio0.896655GREEN
14d_0 vs d_2clip_m_total0.0GREEN
15d_0 vs d_2overlap_passTrueGREEN

edge_0.01_below, edge_0.01_above

Share of pairwise propensity mass near 0 or 1.

Result
comparisonedge_0.01_belowedge_0.01_aboveflag_edge_001
0d_0 vs d_10.0000660.0GREEN
1d_0 vs d_20.0000000.0GREEN

ks

Kolmogorov-Smirnov distance between pairwise score distributions.

Result
comparisonksflag_ks
0d_0 vs d_10.100367GREEN
1d_0 vs d_20.056139GREEN

auc

AUC for separating baseline vs active treatment using pairwise propensity score.

Result
comparisonaucflag_auc
0d_0 vs d_10.566421GREEN
1d_0 vs d_20.533669GREEN

ess_ratio_treated, ess_ratio_baseline

Effective sample size ratios implied by inverse-propensity weights.

Result
comparisoness_ratio_treatedess_ratio_baselineflag_ess_treatedflag_ess_baseline
0d_0 vs d_10.7219380.896655GREENGREEN
1d_0 vs d_20.8141030.896655GREENGREEN

clip_m_total

Share of observations affected by propensity trimming in each comparison.

Result
comparisonclip_m_totalflag_clip_m
0d_0 vs d_10.000066GREEN
1d_0 vs d_20.000000GREEN

Overall overlap verdict

Result

{'overall_flag': 'GREEN', 'all_comparisons_pass': True}

Score

Score diagnostics validate orthogonal moments and influence behavior for each baseline contrast.

Result
comparisonmetricvalueflag
0d_1 vs d_0se_plugin5.315128e-02NA
1d_1 vs d_0psi_p99_over_med1.018200e+01YELLOW
2d_1 vs d_0psi_kurtosis9.374657e+01RED
3d_1 vs d_0max_|t|_gk8.303548e+00RED
4d_1 vs d_0max_|t|_g06.044510e+00RED
5d_1 vs d_0max_|t|_mk1.082170e+00RED
6d_1 vs d_0max_|t|_m01.220473e+00RED
7d_1 vs d_0max_|t|8.303548e+00RED
8d_1 vs d_0oos_tstat_fold-1.176496e-15GREEN
9d_1 vs d_0oos_tstat_strict-1.497106e-15GREEN
10d_2 vs d_0se_plugin1.092872e-01NA
11d_2 vs d_0psi_p99_over_med1.790495e+01YELLOW
12d_2 vs d_0psi_kurtosis2.876548e+02RED
13d_2 vs d_0max_|t|_gk8.270394e+00RED
14d_2 vs d_0max_|t|_g06.044510e+00RED
15d_2 vs d_0max_|t|_mk1.596762e+00RED
16d_2 vs d_0max_|t|_m01.220473e+00RED
17d_2 vs d_0max_|t|8.270394e+00RED
18d_2 vs d_0oos_tstat_fold-5.201603e-16GREEN
19d_2 vs d_0oos_tstat_strict-1.248197e-15GREEN

psi_p99_over_med, psi_kurtosis

Tail diagnostics of influence values by comparison.

Result
comparisonse_pluginkurtosisp99_over_med
0d_1 vs d_00.05315193.74657010.182000
1d_2 vs d_00.109287287.65476617.904955

Top influential observations

Result
comparisonipsim_kresidual_kresidual_0
0d_1 vs d_02387-206.5846700.010101-2.105333-1.495193
1d_1 vs d_013045188.0472200.0149992.8174671.790805
2d_1 vs d_07609166.1366470.0598979.9722438.393727
3d_1 vs d_018521-150.7178710.45128521.96833619.422518
4d_1 vs d_05145140.7044960.0302264.2710552.447766
5d_1 vs d_015110112.2911930.12856214.56681512.326445
6d_1 vs d_013471103.7478530.18651620.08629614.916909
7d_1 vs d_06257-88.6852940.41208523.84931921.475964
8d_1 vs d_07117-81.9139140.52425014.66689915.073730
9d_1 vs d_010254-78.4450730.32821917.97709017.290261
10d_2 vs d_018180596.5345570.04139524.44636832.983599
11d_2 vs d_013177528.1142460.12532366.05759869.642200
12d_2 vs d_010961459.4501760.09690244.83727744.149535
13d_2 vs d_016985-385.8275640.015946-6.193279-1.063880
14d_2 vs d_06952327.5436470.03915412.68224718.885766
15d_2 vs d_02193320.0036620.11036935.55355435.993882
16d_2 vs d_012001310.4796980.0262078.12155511.274115
17d_2 vs d_09928235.7192870.04857811.52682412.532611
18d_2 vs d_017424206.4281420.17392835.43984840.677004
19d_2 vs d_015506195.4004760.11267621.97100624.949844

max_|t|_gk, max_|t|_g0, max_|t|_mk, max_|t|_m0

Orthogonality derivative checks by comparison.

Result
comparisonmax_|t|_gkmax_|t|_g0max_|t|_mkmax_|t|_m0max_|t|
0d_1 vs d_08.3035486.044511.0821701.2204738.303548
1d_2 vs d_08.2703946.044511.5967621.2204738.270394

oos_tstat_fold, oos_tstat_strict

Out-of-sample moment tests.

Result
comparisonoos_tstat_foldoos_tstat_strictp_value_foldp_value_strict
0d_1 vs d_0-1.176496e-15-1.497106e-151.01.0
1d_2 vs d_0-5.201603e-16-1.248197e-151.01.0
Result

{'overall_flag': 'RED', 'flags': {'psi_tail_ratio': 'YELLOW', 'psi_kurtosis': 'RED', 'ortho_max_|t|': 'RED', 'oos_moment': 'GREEN', 'ortho_max_|t|gk': 'RED', 'ortho_max|t|g0': 'RED', 'ortho_max|t|mk': 'GREEN', 'ortho_max|t|m0': 'GREEN'}, 'flags_by_comparison': comparison psi_tail_ratio psi_kurtosis ortho_max|t| oos_moment
0 d_1 vs d_0 YELLOW RED RED GREEN
1 d_2 vs d_0 YELLOW RED RED GREEN

overall_flag
0 RED
1 RED }

SUTVA

Result

1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?

Unconfoundedness

Result
comparisonmetricvalueflag
0d_0 vs d_1balance_max_smd0.05714GREEN
1d_0 vs d_1balance_frac_violations0.0GREEN
2d_0 vs d_1balance_passTrueGREEN
3d_0 vs d_2balance_max_smd0.022568GREEN
4d_0 vs d_2balance_frac_violations0.0GREEN
5d_0 vs d_2balance_passTrueGREEN
6overallbalance_max_smd0.05714GREEN
7overallbalance_frac_violations0.0GREEN
8overallbalance_passTrueGREEN

balance_max_smd, balance_frac_violations

Weighted SMD checks for each baseline-vs-treatment comparison.

Result
comparisonsmd_maxfrac_violationspassflag_max_smdflag_violationsoverall_flag
0d_0 vs d_10.0571400.0TrueGREENGREENGREEN
1d_0 vs d_20.0225680.0TrueGREENGREENGREEN

Worst covariates by weighted SMD

Result

avg_sessions_week 0.057140 tenure_months 0.046042 urban_resident 0.028845 premium_user 0.024638 support_tickets_q 0.016798 spend_last_month 0.013420 discount_eligible 0.007124 credit_utilization 0.004336 dtype: float64

Result

{'overall_flag': 'GREEN', 'flags': {'balance_max_smd': 'GREEN', 'balance_violations': 'GREEN'}, 'overall_balance_pass': True}

Sensitivity analysis

Result

================== Bias-aware Interval ==================

------------------ Scenario ------------------ Significance Level: alpha=0.05 Null Hypothesis: H0=0.0 Sensitivity parameters: cf_y=0.010101010101010102; r2_d=[0.01 0.01], rho=[1. 1.], use_signed_rr=False

theta se max_bias max_bias_base bound_width sigma2 nu2 sampling_ci_l sampling_ci_u theta_l theta_u bias_aware_ci_l bias_aware_ci_u rv rva d_1 vs d_0 -1.225401 0.053151 0.071358 7.064466 0.071358 11.726175 4.256007 -1.329575 -1.121226 -1.296759 -1.154043 -1.400830 -1.048932 0.748664 0.713779 d_2 vs d_0 2.570377 0.109287 0.075791 7.503273 0.075791 11.726175 4.801148 2.356178 2.784576 2.494587 2.646168 2.281413 2.861468 0.920747 0.907083

{'theta': array([-1.22540079, 2.57037715]), 'se': array([0.05315128, 0.10928717]), 'alpha': 0.05, 'z': 1.959963984540054, 'H0': 0.0, 'sampling_ci': array([[-1.32957538, -1.1212262 ], [ 2.35617823, 2.78457606]]), 'theta_bounds_cofounding': array([[-1.29675903, -1.15404255], [ 2.49458652, 2.64616778]]), 'bias_aware_ci': array([[-1.40082993, -1.0489322 ], [ 2.28141319, 2.86146838]]), 'max_bias_base': array([7.06446594, 7.50327258]), 'max_bias': array([0.07135824, 0.07579063]), 'bound_width': array([0.07135824, 0.07579063]), 'sigma2': 11.726175012924882, 'nu2': array([4.25600667, 4.8011478 ]), 'rv': array([0.74866425, 0.92074748]), 'rva': array([0.71377934, 0.90708265]), 'contrast_labels': ['d_1 vs d_0', 'd_2 vs d_0'], 'params': {'cf_y': 0.010101010101010102, 'r2_d': array([0.01, 0.01]), 'rho': array([1., 1.]), 'use_signed_rr': False}}

Result
cf_yr2_yr2_drhotheta_longtheta_shortdelta
d_1 vs d_08.091184e-088.091183e-084.653160e-061.0-1.225401-1.2434230.018023
d_2 vs d_08.091184e-088.091183e-083.465145e-071.02.5703772.5000750.070302