Multi Unconfoundedness
Multi Unconfoundedness is observational/quasi experimental scenario of causal inference with multiple treatments. Treatments are not randomly assigned so we need to control confounders to estimate causal effect
Data
In our research we should estimate effect of different gamification mechanics. Our outcome is number of sessions per month.
Treatments:
- d_0: No mechanics used
- d_1: Used first set of mechanics
- d_2: Used second set of mechanics
DGP is from Causalis. Read more at https://causalis.causalcraft.com/articles/generate_multitreatment_gamma_26
| y | d_0 | d_1 | d_2 | tenure_months | avg_sessions_week | spend_last_month | premium_user | urban_resident | support_tickets_q | ... | m_obs_d_1 | tau_link_d_1 | m_d_2 | m_obs_d_2 | tau_link_d_2 | g_d_0 | g_d_1 | g_d_2 | cate_d_1 | cate_d_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.422769 | 1.0 | 0.0 | 0.0 | 27.656605 | 3.198667 | 89.609464 | 0.0 | 1.0 | 0.0 | ... | 0.246687 | -0.352005 | 0.220781 | 0.220781 | 0.494166 | 3.279384 | 2.306314 | 5.375338 | -0.973070 | 2.095954 |
| 1 | 7.566231 | 1.0 | 0.0 | 0.0 | 23.798386 | 3.362415 | 102.337236 | 0.0 | 0.0 | 3.0 | ... | 0.179393 | -0.307360 | 0.236958 | 0.236958 | 0.420278 | 2.807850 | 2.064853 | 4.274630 | -0.742997 | 1.466780 |
| 2 | 1.702662 | 0.0 | 0.0 | 1.0 | 28.425009 | 3.391819 | 102.660712 | 0.0 | 1.0 | 1.0 | ... | 0.210566 | -0.320189 | 0.218245 | 0.218245 | 0.502415 | 3.069919 | 2.228798 | 5.073677 | -0.841121 | 2.003758 |
| 3 | 1.827530 | 1.0 | 0.0 | 0.0 | 18.860066 | 4.071175 | 83.593417 | 0.0 | 0.0 | 2.0 | ... | 0.176729 | -0.316241 | 0.237639 | 0.237639 | 0.441677 | 2.716805 | 1.980234 | 4.225485 | -0.736571 | 1.508680 |
| 4 | 1.429843 | 0.0 | 1.0 | 0.0 | 17.853087 | 3.140075 | 79.209870 | 0.0 | 1.0 | 1.0 | ... | 0.232492 | -0.350130 | 0.247027 | 0.247027 | 0.493624 | 3.224354 | 2.271869 | 5.282273 | -0.952485 | 2.057919 |
5 rows × 26 columns
Ground truth ATE for d_1 vs d_0 is -1.1950325692907122 Ground truth ATE for d_2 vs d_0 is 2.530398527003894
MultiCausalData(df=(100000, 12), treatment_names=['d_0', 'd_1', 'd_2'], control_treatment='d_0')outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'premium_user', 'urban_resident', 'support_tickets_q', 'discount_eligible', 'credit_utilization'], user_id=None,
EDA
| treatment | count | mean | std | min | p10 | p25 | median | p75 | p90 | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | d_0 | 50115 | 3.758417 | 3.106725 | 0.015427 | 0.887906 | 1.626326 | 2.937863 | 4.957415 | 7.577785 | 50.239323 |
| 1 | d_2 | 25008 | 6.541717 | 5.539708 | 0.043125 | 1.512610 | 2.775637 | 5.102611 | 8.584913 | 13.348761 | 79.125235 |
| 2 | d_1 | 24877 | 2.980817 | 2.412763 | 0.009022 | 0.711997 | 1.306774 | 2.352234 | 3.946463 | 5.985070 | 25.169272 |


| treatment | n | outlier_count | outlier_rate | lower_bound | upper_bound | has_outliers | method | tail | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | d_0 | 50115 | 2288 | 0.045655 | -3.370308 | 9.954048 | True | iqr | both |
| 1 | d_2 | 25008 | 1173 | 0.046905 | -5.938277 | 17.298826 | True | iqr | both |
| 2 | d_1 | 24877 | 1067 | 0.042891 | -2.652760 | 7.905997 | True | iqr | both |
As we see there are heavy tails in distribution of outcome. We won't trim them because high activity means high income of revenue for us
| confounders | mean_d_0 | mean_d_1 | abs_diff | smd | ks_pvalue | |
|---|---|---|---|---|---|---|
| 0 | avg_sessions_week | 4.827957 | 5.330050 | 0.502092 | 0.253187 | 0.00000 |
| 1 | premium_user | 0.217440 | 0.296861 | 0.079421 | 0.182466 | 0.00000 |
| 2 | tenure_months | 23.672462 | 25.752063 | 2.079601 | 0.176703 | 0.00000 |
| 3 | spend_last_month | 82.894719 | 96.062898 | 13.168180 | 0.149709 | 0.00000 |
| 4 | discount_eligible | 0.325870 | 0.395626 | 0.069756 | 0.145642 | 0.00000 |
| 5 | urban_resident | 0.585912 | 0.638421 | 0.052509 | 0.107919 | 0.00000 |
| 6 | support_tickets_q | 1.478140 | 1.492302 | 0.014162 | 0.011558 | 0.47358 |
| 7 | credit_utilization | 0.449627 | 0.448996 | 0.000632 | -0.005811 | 0.86692 |
| confounders | mean_d_0 | mean_d_1 | abs_diff | smd | ks_pvalue | |
|---|---|---|---|---|---|---|
| 0 | premium_user | 0.217440 | 0.274072 | 0.056632 | 0.131823 | 0.00000 |
| 1 | avg_sessions_week | 4.827957 | 5.059494 | 0.231536 | 0.116747 | 0.00000 |
| 2 | spend_last_month | 82.894719 | 89.334021 | 6.439302 | 0.076205 | 0.00000 |
| 3 | support_tickets_q | 1.478140 | 1.569378 | 0.091238 | 0.073883 | 0.00000 |
| 4 | discount_eligible | 0.325870 | 0.356006 | 0.030136 | 0.063605 | 0.00000 |
| 5 | urban_resident | 0.585912 | 0.604687 | 0.018774 | 0.038256 | 0.00002 |
| 6 | tenure_months | 23.672462 | 23.391337 | 0.281125 | -0.024131 | 0.00373 |
| 7 | credit_utilization | 0.449627 | 0.451855 | 0.002228 | 0.020493 | 0.02836 |
And data is highly biased by confounders
Inference
Explanation of MultiTreatmentIRM
0) Assumptions
- SUTVA / consistency: no interference, no hidden treatment versions, and observed outcome equals the potential outcome under realized arm.
- Multi-arm unconfoundedness:
where is one-hot with baseline arm .
- Positivity / overlap:
In practice, the implementation enforces stability with propensity trimming.
1) Data and estimand
For each unit we observe with one-hot and .
MultiTreatmentIRM estimates vector contrasts against baseline:
So outputs are pairwise ATEs such as d_1 vs d_0, d_2 vs d_0.
2) Nuisance functions
For each arm :
These are estimated out-of-fold by cross-fitting (to reduce overfitting bias in final moments).
3) Cross-fitting logic
With folds :
- Train multiclass propensity model on , predict on .
- For each arm , train outcome model on rows in where , predict on .
- Repeat for all folds and stitch predictions.
4) Multiclass trimming
Predicted propensities are stabilized by lower-bound trimming and row renormalization:
This keeps each row on the probability simplex and avoids exploding IPW weights.
5) Orthogonal score for each contrast
Define residuals and IPW representers:
(If normalize_ipw=True, is column-normalized in Hajek style.)
For each active arm vs baseline :
Moment condition:
6) Inference
Influence function per contrast:
Then
with Wald CI
P-values are normal-approximation; significance flag is Bonferroni-adjusted across contrasts.
7) Relative effect reported by the model
Baseline mean is estimated via orthogonal signal:
Relative effect (%):
with CI from delta-method variance (as implemented in model.py).
| d_1 vs d_0 | d_2 vs d_0 | |
|---|---|---|
| field | ||
| estimand | ATE | ATE |
| model | MultiTreatmentIRM | MultiTreatmentIRM |
| value | -1.1832 (ci_abs: -1.2233, -1.1431) | 2.5298 (ci_abs: 2.4565, 2.6031) |
| value_relative | -29.9638 (ci_rel: -30.8175, -29.1101) | 64.0643 (ci_rel: 61.9680, 66.1606) |
| alpha | 0.0500 | 0.0500 |
| p_value | 0.0000 | 0.0000 |
| is_significant | True | True |
| n_treated | 24877 | 25008 |
| n_control | 50115 | 50115 |
| treatment_mean | 2.9808 | 6.5417 |
| control_mean | 3.7584 | 3.7584 |
| time | 2026-02-22 | 2026-02-22 |
Refutation
Unconfoundedness
| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_0 vs d_1 | balance_max_smd | 0.013307 | GREEN |
| 1 | d_0 vs d_1 | balance_frac_violations | 0.0 | GREEN |
| 2 | d_0 vs d_1 | balance_pass | True | GREEN |
| 3 | d_0 vs d_2 | balance_max_smd | 0.004068 | GREEN |
| 4 | d_0 vs d_2 | balance_frac_violations | 0.0 | GREEN |
| 5 | d_0 vs d_2 | balance_pass | True | GREEN |
| 6 | overall | balance_max_smd | 0.013307 | GREEN |
| 7 | overall | balance_frac_violations | 0.0 | GREEN |
| 8 | overall | balance_pass | True | GREEN |
Sensitivity
| cf_y | r2_y | r2_d | rho | theta_long | theta_short | delta | |
|---|---|---|---|---|---|---|---|
| d_1 vs d_0 | 3.022036e-07 | 3.022035e-07 | 1.351745e-07 | -1.0 | -1.183219 | -1.146914 | -0.036304 |
| d_2 vs d_0 | 3.022036e-07 | 3.022035e-07 | 1.461506e-07 | -1.0 | 2.529790 | 2.470708 | 0.059082 |
{'theta': array([-1.18321853, 2.52979027]), 'se': array([0.02044409, 0.03741762]), 'alpha': 0.05, 'z': 1.959963984540054, 'H0': 0.0, 'sampling_ci': array([[-1.22328822, -1.14314885], [ 2.45645308, 2.60312746]]), 'theta_bounds_cofounding': array([[-1.61866521, -0.74777186], [ 2.1016666 , 2.95791394]]), 'bias_aware_ci': array([[-1.65985899, -0.70803275], [ 2.02990876, 3.03312867]]), 'max_bias_base': array([8.48841832, 8.34566666]), 'max_bias': array([0.43544667, 0.42812367]), 'bound_width': array([0.43544667, 0.42812367]), 'sigma2': 11.79593879649448, 'nu2': array([6.10830955, 5.90458744]), 'rv': array([0.27985187, 0.64760316]), 'rva': array([0.26617826, 0.63406235]), 'contrast_labels': ['d_1 vs d_0', 'd_2 vs d_0'], 'params': {'cf_y': 0.05, 'r2_d': array([0.05, 0.05]), 'rho': array([1., 1.]), 'use_signed_rr': False}}
Overlap

| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_0 vs d_1 | edge_0.01_below | 0.0 | GREEN |
| 1 | d_0 vs d_1 | edge_0.01_above | 0.0 | GREEN |
| 2 | d_0 vs d_1 | KS | 0.141067 | GREEN |
| 3 | d_0 vs d_1 | AUC | 0.596814 | GREEN |
| 4 | d_0 vs d_1 | ESS_treated_ratio | 0.895799 | GREEN |
| 5 | d_0 vs d_1 | ESS_baseline_ratio | 0.954377 | GREEN |
| 6 | d_0 vs d_1 | clip_m_total | 0.0 | GREEN |
| 7 | d_0 vs d_1 | overlap_pass | True | GREEN |
| 8 | d_0 vs d_2 | edge_0.01_below | 0.0 | GREEN |
| 9 | d_0 vs d_2 | edge_0.01_above | 0.0 | GREEN |
| 10 | d_0 vs d_2 | KS | 0.074956 | GREEN |
| 11 | d_0 vs d_2 | AUC | 0.549007 | GREEN |
| 12 | d_0 vs d_2 | ESS_treated_ratio | 0.948692 | GREEN |
| 13 | d_0 vs d_2 | ESS_baseline_ratio | 0.954377 | GREEN |
| 14 | d_0 vs d_2 | clip_m_total | 0.0 | GREEN |
| 15 | d_0 vs d_2 | overlap_pass | True | GREEN |
SUTVA
1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?
Score
| comparison | metric | value | flag | |
|---|---|---|---|---|
| 0 | d_1 vs d_0 | se_plugin | 2.044409e-02 | NA |
| 1 | d_1 vs d_0 | psi_p99_over_med | 9.656081e+00 | GREEN |
| 2 | d_1 vs d_0 | psi_kurtosis | 6.103705e+01 | RED |
| 3 | d_1 vs d_0 | max_|t|_gk | 5.488800e+00 | RED |
| 4 | d_1 vs d_0 | max_|t|_g0 | 4.107007e+00 | RED |
| 5 | d_1 vs d_0 | max_|t|_mk | 1.723246e+00 | RED |
| 6 | d_1 vs d_0 | max_|t|_m0 | 9.548751e-01 | RED |
| 7 | d_1 vs d_0 | max_|t| | 5.488800e+00 | RED |
| 8 | d_1 vs d_0 | oos_tstat_fold | -1.028761e-15 | GREEN |
| 9 | d_1 vs d_0 | oos_tstat_strict | -1.779456e-15 | GREEN |
| 10 | d_2 vs d_0 | se_plugin | 3.741762e-02 | NA |
| 11 | d_2 vs d_0 | psi_p99_over_med | 1.610667e+01 | YELLOW |
| 12 | d_2 vs d_0 | psi_kurtosis | 5.568245e+01 | RED |
| 13 | d_2 vs d_0 | max_|t|_gk | 5.333558e+00 | RED |
| 14 | d_2 vs d_0 | max_|t|_g0 | 4.107007e+00 | RED |
| 15 | d_2 vs d_0 | max_|t|_mk | 1.431120e+00 | RED |
| 16 | d_2 vs d_0 | max_|t|_m0 | 9.548751e-01 | RED |
| 17 | d_2 vs d_0 | max_|t| | 5.333558e+00 | RED |
| 18 | d_2 vs d_0 | oos_tstat_fold | -4.861275e-16 | GREEN |
| 19 | d_2 vs d_0 | oos_tstat_strict | -6.684270e-16 | GREEN |

Conclusion
Set of mechanics labeled d_2 performed better and has effect 2.5355 (ci_abs: 2.4600, 2.6111) sessions per user. However, set of mechanics labeled d_1 perform worse than without mechanics -1.1772 (ci_abs: -1.2174, -1.1370) sessions. We need to turn them off