Refutation flow for multi-treatment IRM
This notebook mirrors refutation_flow.ipynb, but for the
causalis.scenarios.multi_unconfoundedness scenario.
We estimate pairwise ATE contrasts against the baseline treatment d_0, then run:
- overlap diagnostics,
- score diagnostics,
- unconfoundedness balance checks,
- sensitivity analysis.
| y | d_0 | d_1 | d_2 | tenure_months | avg_sessions_week | spend_last_month | premium_user | urban_resident | support_tickets_q | ... | m_obs_d_1 | tau_link_d_1 | m_d_2 | m_obs_d_2 | tau_link_d_2 | g_d_0 | g_d_1 | g_d_2 | cate_d_1 | cate_d_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.724081 | 1.0 | 0.0 | 0.0 | 27.656605 | 3.198667 | 89.609464 | 0.0 | 1.0 | 0.0 | ... | 0.245737 | -0.352005 | 0.220739 | 0.220739 | 0.494166 | 3.279384 | 2.306314 | 5.375338 | -0.973070 | 2.095954 |
| 1 | 0.658436 | 0.0 | 1.0 | 0.0 | 23.798386 | 3.362415 | 102.337236 | 0.0 | 0.0 | 3.0 | ... | 0.178640 | -0.307360 | 0.236830 | 0.236830 | 0.420278 | 2.807850 | 2.064853 | 4.274630 | -0.742997 | 1.466780 |
| 2 | 3.894951 | 0.0 | 1.0 | 0.0 | 28.425009 | 3.391819 | 102.660712 | 0.0 | 1.0 | 1.0 | ... | 0.209711 | -0.320189 | 0.218158 | 0.218158 | 0.502415 | 3.069919 | 2.228798 | 5.073677 | -0.841121 | 2.003758 |
| 3 | 2.363204 | 0.0 | 1.0 | 0.0 | 18.860066 | 4.071175 | 83.593417 | 0.0 | 0.0 | 2.0 | ... | 0.175985 | -0.316241 | 0.237508 | 0.237508 | 0.441677 | 2.716805 | 1.980234 | 4.225485 | -0.736571 | 1.508680 |
| 4 | 6.232463 | 0.0 | 0.0 | 1.0 | 17.853087 | 3.140075 | 79.209870 | 0.0 | 1.0 | 1.0 | ... | 0.231590 | -0.350130 | 0.246973 | 0.246973 | 0.493624 | 3.224354 | 2.271869 | 5.282273 | -0.952485 | 2.057919 |
5 rows × 26 columns
MultiCausalData(df=(20000, 12), treatment_names=['d_0', 'd_1', 'd_2'], control_treatment='d_0')outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'premium_user', 'urban_resident', 'support_tickets_q', 'discount_eligible', 'credit_utilization'], user_id=None,
Inference
| d_1 vs d_0 | d_2 vs d_0 | |
|---|---|---|
| field | ||
| estimand | ATE | ATE |
| model | MultiTreatmentIRM | MultiTreatmentIRM |
| value | -1.2254 (ci_abs: -1.3296, -1.1212) | 2.5704 (ci_abs: 2.3562, 2.7846) |
| value_relative | -31.0949 (ci_rel: -33.3772, -28.8125) | 65.2240 (ci_rel: 59.3040, 71.1441) |
| alpha | 0.0500 | 0.0500 |
| p_value | 0.0000 | 0.0000 |
| is_significant | True | True |
| n_treated | 5003 | 4877 |
| n_control | 10120 | 10120 |
| treatment_mean | 2.9112 | 6.5755 |
| control_mean | 3.7633 | 3.7633 |
| time | 2026-02-22 | 2026-02-22 |
Overlap
For multi-treatment IRM, overlap is checked pairwise: baseline d_0 vs each active treatment.
Key metrics are reported by comparison (d_0 vs d_1, d_0 vs d_2, ...).
| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_0 vs d_1 | edge_0.01_below | 0.000066 | GREEN |
| 1 | d_0 vs d_1 | edge_0.01_above | 0.0 | GREEN |
| 2 | d_0 vs d_1 | KS | 0.100367 | GREEN |
| 3 | d_0 vs d_1 | AUC | 0.566421 | GREEN |
| 4 | d_0 vs d_1 | ESS_treated_ratio | 0.721938 | GREEN |
| 5 | d_0 vs d_1 | ESS_baseline_ratio | 0.896655 | GREEN |
| 6 | d_0 vs d_1 | clip_m_total | 0.000066 | GREEN |
| 7 | d_0 vs d_1 | overlap_pass | True | GREEN |
| 8 | d_0 vs d_2 | edge_0.01_below | 0.0 | GREEN |
| 9 | d_0 vs d_2 | edge_0.01_above | 0.0 | GREEN |
| 10 | d_0 vs d_2 | KS | 0.056139 | GREEN |
| 11 | d_0 vs d_2 | AUC | 0.533669 | GREEN |
| 12 | d_0 vs d_2 | ESS_treated_ratio | 0.814103 | GREEN |
| 13 | d_0 vs d_2 | ESS_baseline_ratio | 0.896655 | GREEN |
| 14 | d_0 vs d_2 | clip_m_total | 0.0 | GREEN |
| 15 | d_0 vs d_2 | overlap_pass | True | GREEN |
edge_0.01_below, edge_0.01_above
Share of pairwise propensity mass near 0 or 1.
| comparison | edge_0.01_below | edge_0.01_above | flag_edge_001 | |
|---|---|---|---|---|
| 0 | d_0 vs d_1 | 0.000066 | 0.0 | GREEN |
| 1 | d_0 vs d_2 | 0.000000 | 0.0 | GREEN |
ks
Kolmogorov-Smirnov distance between pairwise score distributions.
| comparison | ks | flag_ks | |
|---|---|---|---|
| 0 | d_0 vs d_1 | 0.100367 | GREEN |
| 1 | d_0 vs d_2 | 0.056139 | GREEN |
auc
AUC for separating baseline vs active treatment using pairwise propensity score.
| comparison | auc | flag_auc | |
|---|---|---|---|
| 0 | d_0 vs d_1 | 0.566421 | GREEN |
| 1 | d_0 vs d_2 | 0.533669 | GREEN |
ess_ratio_treated, ess_ratio_baseline
Effective sample size ratios implied by inverse-propensity weights.
| comparison | ess_ratio_treated | ess_ratio_baseline | flag_ess_treated | flag_ess_baseline | |
|---|---|---|---|---|---|
| 0 | d_0 vs d_1 | 0.721938 | 0.896655 | GREEN | GREEN |
| 1 | d_0 vs d_2 | 0.814103 | 0.896655 | GREEN | GREEN |
clip_m_total
Share of observations affected by propensity trimming in each comparison.
| comparison | clip_m_total | flag_clip_m | |
|---|---|---|---|
| 0 | d_0 vs d_1 | 0.000066 | GREEN |
| 1 | d_0 vs d_2 | 0.000000 | GREEN |
Overall overlap verdict
{'overall_flag': 'GREEN', 'all_comparisons_pass': True}
Score
Score diagnostics validate orthogonal moments and influence behavior for each baseline contrast.
| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_1 vs d_0 | se_plugin | 5.315128e-02 | NA |
| 1 | d_1 vs d_0 | psi_p99_over_med | 1.018200e+01 | YELLOW |
| 2 | d_1 vs d_0 | psi_kurtosis | 9.374657e+01 | RED |
| 3 | d_1 vs d_0 | max_|t|_gk | 8.303548e+00 | RED |
| 4 | d_1 vs d_0 | max_|t|_g0 | 6.044510e+00 | RED |
| 5 | d_1 vs d_0 | max_|t|_mk | 1.082170e+00 | RED |
| 6 | d_1 vs d_0 | max_|t|_m0 | 1.220473e+00 | RED |
| 7 | d_1 vs d_0 | max_|t| | 8.303548e+00 | RED |
| 8 | d_1 vs d_0 | oos_tstat_fold | -1.176496e-15 | GREEN |
| 9 | d_1 vs d_0 | oos_tstat_strict | -1.497106e-15 | GREEN |
| 10 | d_2 vs d_0 | se_plugin | 1.092872e-01 | NA |
| 11 | d_2 vs d_0 | psi_p99_over_med | 1.790495e+01 | YELLOW |
| 12 | d_2 vs d_0 | psi_kurtosis | 2.876548e+02 | RED |
| 13 | d_2 vs d_0 | max_|t|_gk | 8.270394e+00 | RED |
| 14 | d_2 vs d_0 | max_|t|_g0 | 6.044510e+00 | RED |
| 15 | d_2 vs d_0 | max_|t|_mk | 1.596762e+00 | RED |
| 16 | d_2 vs d_0 | max_|t|_m0 | 1.220473e+00 | RED |
| 17 | d_2 vs d_0 | max_|t| | 8.270394e+00 | RED |
| 18 | d_2 vs d_0 | oos_tstat_fold | -5.201603e-16 | GREEN |
| 19 | d_2 vs d_0 | oos_tstat_strict | -1.248197e-15 | GREEN |
psi_p99_over_med, psi_kurtosis
Tail diagnostics of influence values by comparison.
| comparison | se_plugin | kurtosis | p99_over_med | |
|---|---|---|---|---|
| 0 | d_1 vs d_0 | 0.053151 | 93.746570 | 10.182000 |
| 1 | d_2 vs d_0 | 0.109287 | 287.654766 | 17.904955 |
Top influential observations
| comparison | i | psi | m_k | residual_k | residual_0 | |
|---|---|---|---|---|---|---|
| 0 | d_1 vs d_0 | 2387 | -206.584670 | 0.010101 | -2.105333 | -1.495193 |
| 1 | d_1 vs d_0 | 13045 | 188.047220 | 0.014999 | 2.817467 | 1.790805 |
| 2 | d_1 vs d_0 | 7609 | 166.136647 | 0.059897 | 9.972243 | 8.393727 |
| 3 | d_1 vs d_0 | 18521 | -150.717871 | 0.451285 | 21.968336 | 19.422518 |
| 4 | d_1 vs d_0 | 5145 | 140.704496 | 0.030226 | 4.271055 | 2.447766 |
| 5 | d_1 vs d_0 | 15110 | 112.291193 | 0.128562 | 14.566815 | 12.326445 |
| 6 | d_1 vs d_0 | 13471 | 103.747853 | 0.186516 | 20.086296 | 14.916909 |
| 7 | d_1 vs d_0 | 6257 | -88.685294 | 0.412085 | 23.849319 | 21.475964 |
| 8 | d_1 vs d_0 | 7117 | -81.913914 | 0.524250 | 14.666899 | 15.073730 |
| 9 | d_1 vs d_0 | 10254 | -78.445073 | 0.328219 | 17.977090 | 17.290261 |
| 10 | d_2 vs d_0 | 18180 | 596.534557 | 0.041395 | 24.446368 | 32.983599 |
| 11 | d_2 vs d_0 | 13177 | 528.114246 | 0.125323 | 66.057598 | 69.642200 |
| 12 | d_2 vs d_0 | 10961 | 459.450176 | 0.096902 | 44.837277 | 44.149535 |
| 13 | d_2 vs d_0 | 16985 | -385.827564 | 0.015946 | -6.193279 | -1.063880 |
| 14 | d_2 vs d_0 | 6952 | 327.543647 | 0.039154 | 12.682247 | 18.885766 |
| 15 | d_2 vs d_0 | 2193 | 320.003662 | 0.110369 | 35.553554 | 35.993882 |
| 16 | d_2 vs d_0 | 12001 | 310.479698 | 0.026207 | 8.121555 | 11.274115 |
| 17 | d_2 vs d_0 | 9928 | 235.719287 | 0.048578 | 11.526824 | 12.532611 |
| 18 | d_2 vs d_0 | 17424 | 206.428142 | 0.173928 | 35.439848 | 40.677004 |
| 19 | d_2 vs d_0 | 15506 | 195.400476 | 0.112676 | 21.971006 | 24.949844 |
max_|t|_gk, max_|t|_g0, max_|t|_mk, max_|t|_m0
Orthogonality derivative checks by comparison.
| comparison | max_|t|_gk | max_|t|_g0 | max_|t|_mk | max_|t|_m0 | max_|t| | |
|---|---|---|---|---|---|---|
| 0 | d_1 vs d_0 | 8.303548 | 6.04451 | 1.082170 | 1.220473 | 8.303548 |
| 1 | d_2 vs d_0 | 8.270394 | 6.04451 | 1.596762 | 1.220473 | 8.270394 |
oos_tstat_fold, oos_tstat_strict
Out-of-sample moment tests.
| comparison | oos_tstat_fold | oos_tstat_strict | p_value_fold | p_value_strict | |
|---|---|---|---|---|---|
| 0 | d_1 vs d_0 | -1.176496e-15 | -1.497106e-15 | 1.0 | 1.0 |
| 1 | d_2 vs d_0 | -5.201603e-16 | -1.248197e-15 | 1.0 | 1.0 |
{'overall_flag': 'RED',
'flags': {'psi_tail_ratio': 'YELLOW',
'psi_kurtosis': 'RED',
'ortho_max_|t|': 'RED',
'oos_moment': 'GREEN',
'ortho_max_|t|gk': 'RED',
'ortho_max|t|g0': 'RED',
'ortho_max|t|mk': 'GREEN',
'ortho_max|t|m0': 'GREEN'},
'flags_by_comparison': comparison psi_tail_ratio psi_kurtosis ortho_max|t| oos_moment
0 d_1 vs d_0 YELLOW RED RED GREEN
1 d_2 vs d_0 YELLOW RED RED GREEN
overall_flag
0 RED
1 RED }
SUTVA
1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?
Unconfoundedness
| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_0 vs d_1 | balance_max_smd | 0.05714 | GREEN |
| 1 | d_0 vs d_1 | balance_frac_violations | 0.0 | GREEN |
| 2 | d_0 vs d_1 | balance_pass | True | GREEN |
| 3 | d_0 vs d_2 | balance_max_smd | 0.022568 | GREEN |
| 4 | d_0 vs d_2 | balance_frac_violations | 0.0 | GREEN |
| 5 | d_0 vs d_2 | balance_pass | True | GREEN |
| 6 | overall | balance_max_smd | 0.05714 | GREEN |
| 7 | overall | balance_frac_violations | 0.0 | GREEN |
| 8 | overall | balance_pass | True | GREEN |
balance_max_smd, balance_frac_violations
Weighted SMD checks for each baseline-vs-treatment comparison.
| comparison | smd_max | frac_violations | pass | flag_max_smd | flag_violations | overall_flag | |
|---|---|---|---|---|---|---|---|
| 0 | d_0 vs d_1 | 0.057140 | 0.0 | True | GREEN | GREEN | GREEN |
| 1 | d_0 vs d_2 | 0.022568 | 0.0 | True | GREEN | GREEN | GREEN |
Worst covariates by weighted SMD
avg_sessions_week 0.057140 tenure_months 0.046042 urban_resident 0.028845 premium_user 0.024638 support_tickets_q 0.016798 spend_last_month 0.013420 discount_eligible 0.007124 credit_utilization 0.004336 dtype: float64
{'overall_flag': 'GREEN', 'flags': {'balance_max_smd': 'GREEN', 'balance_violations': 'GREEN'}, 'overall_balance_pass': True}
Sensitivity analysis
================== Bias-aware Interval ==================
------------------ Scenario ------------------ Significance Level: alpha=0.05 Null Hypothesis: H0=0.0 Sensitivity parameters: cf_y=0.010101010101010102; r2_d=[0.01 0.01], rho=[1. 1.], use_signed_rr=False
theta se max_bias max_bias_base bound_width sigma2 nu2 sampling_ci_l sampling_ci_u theta_l theta_u bias_aware_ci_l bias_aware_ci_u rv rva d_1 vs d_0 -1.225401 0.053151 0.071358 7.064466 0.071358 11.726175 4.256007 -1.329575 -1.121226 -1.296759 -1.154043 -1.400830 -1.048932 0.748664 0.713779 d_2 vs d_0 2.570377 0.109287 0.075791 7.503273 0.075791 11.726175 4.801148 2.356178 2.784576 2.494587 2.646168 2.281413 2.861468 0.920747 0.907083
{'theta': array([-1.22540079, 2.57037715]), 'se': array([0.05315128, 0.10928717]), 'alpha': 0.05, 'z': 1.959963984540054, 'H0': 0.0, 'sampling_ci': array([[-1.32957538, -1.1212262 ], [ 2.35617823, 2.78457606]]), 'theta_bounds_cofounding': array([[-1.29675903, -1.15404255], [ 2.49458652, 2.64616778]]), 'bias_aware_ci': array([[-1.40082993, -1.0489322 ], [ 2.28141319, 2.86146838]]), 'max_bias_base': array([7.06446594, 7.50327258]), 'max_bias': array([0.07135824, 0.07579063]), 'bound_width': array([0.07135824, 0.07579063]), 'sigma2': 11.726175012924882, 'nu2': array([4.25600667, 4.8011478 ]), 'rv': array([0.74866425, 0.92074748]), 'rva': array([0.71377934, 0.90708265]), 'contrast_labels': ['d_1 vs d_0', 'd_2 vs d_0'], 'params': {'cf_y': 0.010101010101010102, 'r2_d': array([0.01, 0.01]), 'rho': array([1., 1.]), 'use_signed_rr': False}}
| cf_y | r2_y | r2_d | rho | theta_long | theta_short | delta | |
|---|---|---|---|---|---|---|---|
| d_1 vs d_0 | 8.091184e-08 | 8.091183e-08 | 4.653160e-06 | 1.0 | -1.225401 | -1.243423 | 0.018023 |
| d_2 vs d_0 | 8.091184e-08 | 8.091183e-08 | 3.465145e-07 | 1.0 | 2.570377 | 2.500075 | 0.070302 |