Skip to content
Case Study14 min read

Estimation of CX with MutliTreatment DLM

This notebook presents the estimation of cx case with MultitreatmentIRM model

Estimation of CX with MutliTreatment DLM

This notebook presents the estimation of cx case with MultitreatmentIRM model

The problem

When analyzing customer experience, we encounter the problem that there may be multiple impacts that overlap. We can no longer encode treatments binary, so we create groups of unique clients that do not overlap. Let's look at an example from a large fintech company.

We think the effect depends on how users were exposed to the product.

We segmented users and now want to estimate the effect for each group.

  • Target

    • y — binary target variable representing product utilization.
  • Treatments (problems)

    • neg_contact_flg — negative contact to support
    • error_flg — repeat application / error flag of application.
    • neg_contact_flg_error_flg — joint treatment where both problem flags are present.
    • control — baseline group with no problem flag.
  • Strong confounders (past applications/utilization history)

    • prev_apps — past applications.
    • prev_util — past utilization events.
  • Features (covariates)

    • age — age.
    • risk_latent — latent risk factor.
    • income — income.
    • sessions_30d — sessions in the last 30 days.
    • clicks_7d — clicks in the last 7 days.
    • n_products — number of products.
    • has_debt — debt flag.
    • csat_prev — previous CSAT score.
    • prev_contact — past contact history.
    • prev_repeat — past repeat-application history.
    • channel — application channel, represented in the generated dataset as one-hot columns: channel_callcenter, channel_partner, channel_web with app as the reference level.
    • region — region, represented in the generated dataset as one-hot columns: region_B, region_C, region_D with A as the reference level.
  • Noise features

    • product_emb_1, product_emb_2 — no-signal features.

Data

Result
ycontrolneg_contact_flgerror_flgneg_contact_flg_error_flgagerisk_latentincomesessions_30dclicks_7d...m_neg_contact_flg_error_flgm_obs_neg_contact_flg_error_flgtau_link_neg_contact_flg_error_flgg_controlg_neg_contact_flgg_error_flgg_neg_contact_flg_error_flgcate_neg_contact_flgcate_error_flgcate_neg_contact_flg_error_flg
01.00.00.01.00.040.00.612945127181.8946793.018.0...0.4700370.470037-0.650.8009350.8009350.6774650.6774650.0-0.123470-0.123470
10.00.00.00.01.026.0-1.69064043362.7593758.05.0...0.0721740.072174-0.650.2306960.2306960.1353590.1353590.0-0.095337-0.095337
20.00.00.00.01.045.0-0.288110117069.03562611.018.0...0.0890190.089019-0.650.3622560.3622560.2287140.2287140.0-0.133542-0.133542
31.00.00.00.01.047.00.441636123821.13066214.016.0...0.3411290.341129-0.650.7164810.7164810.5688280.5688280.0-0.147653-0.147653
41.00.00.00.01.018.0-0.72261136012.1525718.018.0...0.0676620.067662-0.650.2122280.2122280.1233000.1233000.0-0.088928-0.088928

5 rows × 44 columns

Result

Ground truth ATE for neg_contact_flg vs control is 0.0 Ground truth ATE for error_flg vs control is -0.12500559719513468 Ground truth ATE for neg_contact_flg_error_flg vs control is -0.12500559719513468

Result

MultiCausalData(df=(100000, 25), treatment_names=['control', 'neg_contact_flg', 'error_flg', 'neg_contact_flg_error_flg'], control_treatment='control')outcome='y', confounders=['age', 'risk_latent', 'income', 'sessions_30d', 'clicks_7d', 'n_products', 'has_debt', 'csat_prev', 'prev_contact', 'prev_repeat', 'prev_apps', 'prev_util', 'product_emb_1', 'product_emb_2', 'channel_callcenter', 'channel_partner', 'channel_web', 'region_B', 'region_C', 'region_D'], user_id=None,

EDA

Result
treatmentcountmeanstdminp10p25medianp75p90max
0error_flg176070.4501620.4975240.00.00.00.01.01.01.0
1neg_contact_flg_error_flg152590.5199550.4996180.00.00.01.01.01.01.0
2control422440.5018940.5000020.00.00.01.01.01.01.0
3neg_contact_flg248900.5790680.4937190.00.00.01.01.01.01.0
Result

png

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0channel_web0.7932250.0179470.775278-2.5725010.00000
1channel_partner0.0001660.4275000.4273351.2211470.00000
2channel_callcenter0.0000000.0881470.0881470.4396870.00000
3prev_repeat0.1259350.3015850.1756500.4386300.00000
4age35.86211139.9846084.1224980.3924010.00000
5prev_apps0.2513730.3484410.0970680.2130320.00000
6risk_latent-0.187654-0.0083400.1793150.1827310.00000
7prev_util0.3233120.3892200.0659080.1379490.00000
8has_debt0.2725590.3086840.0361250.0796220.00000
9clicks_7d13.55544013.7758850.2204450.0581150.00000
10prev_contact0.1097430.0922930.017451-0.0579310.00102
11income83694.31286385077.1449351382.8320720.0458450.00000
12sessions_30d9.1083709.2251380.1167670.0377250.00241
13n_products1.8682181.9191230.0509050.0364900.00296
14csat_prev3.9926263.9755960.017030-0.0295500.00931
15product_emb_249.47535749.0015330.473824-0.0164490.02320
16product_emb_10.002541-0.0131030.015644-0.0156780.05017
17region_C0.2001940.2024190.0022250.0055501.00000
18region_B0.3004690.2981770.002292-0.0050041.00000
19region_D0.1489440.1497700.0008260.0023171.00000
Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0channel_web0.7932250.2545600.538665-1.2807400.00000
1prev_contact0.1097430.4874650.3777210.9060970.00000
2channel_partner0.0001660.0825230.0823570.4228140.00000
3risk_latent-0.1876540.1241210.3117750.3172900.00000
4prev_apps0.2513730.3335480.0821750.1813850.00000
5sessions_30d9.1083709.6637610.5553900.1775210.00000
6prev_util0.3233120.3947770.0714650.1493840.00000
7csat_prev3.9926263.9134190.079206-0.1359920.00000
8has_debt0.2725590.3134190.0408600.0898640.00000
9income83694.31286386202.4002162508.0873520.0831420.00000
10clicks_7d13.55544013.8554040.2999640.0791030.00000
11channel_callcenter0.0000000.0013260.0013260.0515271.00000
12n_products1.8682181.9136200.0454020.0327490.02248
13prev_repeat0.1259350.1161910.009744-0.0298730.10152
14age35.86211135.5593010.302810-0.0293330.01887
15region_B0.3004690.2964240.004044-0.0088390.95896
16region_C0.2001940.2030940.0029000.0072270.99937
17product_emb_10.0025410.0094300.0068890.0069100.48216
18region_D0.1489440.1476900.001254-0.0035291.00000
19product_emb_249.47535749.4301730.045185-0.0015680.99499
Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0channel_web0.7932250.0009830.792242-2.7581900.00000
1channel_callcenter0.0000000.5538370.5538371.5755970.00000
2channel_partner0.0001660.3471390.3469741.0303300.00000
3prev_contact0.1097430.4227010.3129580.7570960.00000
4risk_latent-0.1876540.3213380.5089920.5191830.00000
5prev_repeat0.1259350.2931380.1672030.4197830.00000
6prev_apps0.2513730.4445250.1931520.4141000.00000
7age35.86211139.9887934.1266830.3940900.00000
8prev_util0.3233120.4720490.1487370.3074690.00000
9sessions_30d9.1083709.8922600.7838900.2475830.00000
10csat_prev3.9926263.8883560.104270-0.1794960.00000
11has_debt0.2725590.3550040.0824440.1783720.00000
12income83694.31286388186.3762484492.0633840.1478760.00000
13clicks_7d13.55544014.0522970.4968570.1307050.00000
14n_products1.8682181.9427880.0745700.0534410.00024
15region_C0.2001940.2048630.0046690.0116170.96653
16region_B0.3004690.3036240.0031550.0068720.99987
17product_emb_249.47535749.6354940.1601360.0055430.75707
18product_emb_10.0025410.0057890.0032480.0032550.80490
19region_D0.1489440.1496820.0007380.0020701.00000

Training the Model and Reviewing the Results

Result
neg_contact_flg vs controlerror_flg vs controlneg_contact_flg_error_flg vs control
field
estimandATEATEATE
modelMultiTreatmentIRMMultiTreatmentIRMMultiTreatmentIRM
value-0.0043 (ci_abs: -0.0075, -0.0012)-0.1216 (ci_abs: -0.1254, -0.1179)-0.1148 (ci_abs: -0.1182, -0.1114)
value_relative-0.7748 (ci_rel: -1.3397, -0.2098)-21.8139 (ci_rel: -22.4499, -21.1779)-20.5895 (ci_rel: -21.1546, -20.0244)
alpha0.05000.05000.0500
p_value0.00730.00000.0000
is_significantTrueTrueTrue
n_treated248901760715259
n_control422444224442244
treatment_mean0.57910.45020.5200
control_mean0.50190.50190.5019
time2026-04-112026-04-112026-04-11

"Error" or "Negative Contact + Error" decreases probability to utilization in product by 11 p.p

Refutation

Result

png

Result
comparisonn_pairn_treatedn_baselineedge_0.01_belowedge_0.01_aboveksaucess_ratio_treatedess_ratio_baseline...clip_m_baselineclip_m_totalflag_edge_001flag_ksflag_aucflag_ess_treatedflag_ess_baselineflag_clip_moverall_flagpass
0control vs neg_contact_flg6713424890422440.2969140.1255700.7157610.9451770.3837850.268624...0.1315130.429767REDREDREDGREENYELLOWREDREDFalse
1control vs error_flg5985117607422440.6072250.2222350.9512330.9980650.3033550.268624...0.2281000.850195REDREDREDGREENYELLOWREDREDFalse
2control vs neg_contact_flg_error_flg5750315259422440.7261530.2638640.9996801.0000000.2480610.268624...0.2651340.999461REDREDREDYELLOWYELLOWREDREDFalse

3 rows × 25 columns

Result
comparisonmetricvalueflag
0control vs neg_contact_flgedge_0.01_below0.296914RED
1control vs neg_contact_flgedge_0.01_above0.12557RED
2control vs neg_contact_flgKS0.715761RED
3control vs neg_contact_flgAUC0.945177RED
4control vs neg_contact_flgESS_treated_ratio0.383785GREEN
5control vs neg_contact_flgESS_baseline_ratio0.268624YELLOW
6control vs neg_contact_flgclip_m_total0.429767RED
7control vs neg_contact_flgoverlap_passFalseRED
8control vs error_flgedge_0.01_below0.607225RED
9control vs error_flgedge_0.01_above0.222235RED
10control vs error_flgKS0.951233RED
11control vs error_flgAUC0.998065RED
12control vs error_flgESS_treated_ratio0.303355GREEN
13control vs error_flgESS_baseline_ratio0.268624YELLOW
14control vs error_flgclip_m_total0.850195RED
15control vs error_flgoverlap_passFalseRED
16control vs neg_contact_flg_error_flgedge_0.01_below0.726153RED
17control vs neg_contact_flg_error_flgedge_0.01_above0.263864RED
18control vs neg_contact_flg_error_flgKS0.99968RED
19control vs neg_contact_flg_error_flgAUC1.0RED
20control vs neg_contact_flg_error_flgESS_treated_ratio0.248061YELLOW
21control vs neg_contact_flg_error_flgESS_baseline_ratio0.268624YELLOW
22control vs neg_contact_flg_error_flgclip_m_total0.999461RED
23control vs neg_contact_flg_error_flgoverlap_passFalseRED

Because “bad overlap” and “accurate point estimate” are different things here.

  • Your overlap is genuinely bad, not a plotting bug. The notebook diagnostics are extreme: KS = 0.716 / 0.951 / 0.9997, AUC = 0.945 / 0.998 / 1.0, and clip_m_total = 0.43 / 0.85 / 0.999 for the three pairwise comparisons. That means the arms are almost perfectly separable in propensity space.
  • This is expected from the CX DGP. Treatment assignment is intentionally very predictable from observed history: prev_contact enters the contact logit with coefficient 2.20, prev_repeat enters the repeat logit with coefficient 1.30, and the joint arm literally uses contact_logit + repeat_logit as its score in dgp.py, dgp.py, dgp.py. So near-deterministic assignment is baked into the benchmark.
  • The outcome side is much easier than the treatment side. In this DGP the treatment effects are constant on the link scale: theta=[0.0, 0.0, -0.65, -0.65] and tau=[None, None, None, None] in dgp.py. So neg_contact_flg truly has zero effect, and error_flg and neg_contact_flg_error_flg have the same effect by construction.
  • There is also no hidden confounding in the generator defaults: u_strength_y=0.0 and u_strength_d=0.0 in base.py and base.py. So all the variables driving treatment are observed in X, which makes the synthetic problem much friendlier than real data.
  • MultiTreatmentIRM is not pure IPW. Its score is orthogonal / AIPW-style, combining outcome regression g_hat with residual weighting by 1 / m_hat in model.py. If g_hat is learned well, the ATE can still be accurate even when overlap is weak.
  • You also have a huge sample. Even with poor overlap, the ESS ratios still leave thousands of effective observations: roughly 9.6k, 5.3k, and 3.8k treated ESS, with about 11.3k baseline ESS in each comparison. That is plenty for a simple constant-effect synthetic DGP.

One subtlety: the overlap plot and overlap diagnostics prefer m_hat_raw when available, while estimation uses trimmed m_hat with threshold 0.01 in overlap_plot.py, overlap_validation.py, and model.py. So the diagnostic picture is intentionally harsher than the exact weights used in the score.

The takeaway is: this overlap is bad, but the estimate is still close to oracle because the benchmark is very favorable to outcome-model-based recovery. In real data, overlap this bad would make me treat the result as fragile and model-dependent, not reassuring. Your later sensitivity section in estimation_of_cx.ipynb is consistent with that: the bias-aware intervals get much wider even though the sampling CI is tight.

If you want, I can next help you answer one of these precisely:

  1. whether the estimate is mostly coming from g_hat rather than IPW,
  2. how much trimming is rescuing the estimate,
  3. how to soften this DGP so overlap becomes realistic.

Unconfoundedness

Result
comparisonmetricvalueflag
0control vs neg_contact_flgbalance_max_smd0.75095RED
1control vs neg_contact_flgbalance_frac_violations0.3RED
2control vs neg_contact_flgbalance_passFalseRED
3control vs error_flgbalance_max_smd1.655942RED
4control vs error_flgbalance_frac_violations0.45RED
5control vs error_flgbalance_passFalseRED
6control vs neg_contact_flg_error_flgbalance_max_smd1.924802RED
7control vs neg_contact_flg_error_flgbalance_frac_violations0.65RED
8control vs neg_contact_flg_error_flgbalance_passFalseRED
9overallbalance_max_smd1.924802RED
10overallbalance_frac_violations0.466667RED
11overallbalance_passFalseRED
Result
control vs neg_contact_flgcontrol vs error_flgcontrol vs neg_contact_flg_error_flg
channel_web1.2807612.5725352.758223
channel_callcenter0.0515290.4396991.575648
channel_partner0.4228221.2211821.030364
prev_contact0.9061130.0579320.757117
risk_latent0.3172950.1827350.519195
prev_repeat0.0298740.4386400.419794
prev_apps0.1813880.2130370.414109
age0.0293330.3924090.394099
prev_util0.1493870.1379520.307476
sessions_30d0.1775230.0377260.247589

rep_uc Meaning run_unconfoundedness_diagnostics() is checking weighted covariate balance after reweighting by the estimated multiclass propensities d_k / m_hat_k, then computing standardized mean differences (SMDs) between each treatment arm and control in unconfoundedness_validation.py. It uses the trimmed m_hat from the estimate, not m_hat_raw, so this is already judging the “safer” propensities actually used by the score in unconfoundedness_validation.py and model.py.

Your numbers are a hard fail:

  • control vs neg_contact_flg: balance_max_smd = 0.75095, balance_frac_violations = 0.30
  • control vs error_flg: balance_max_smd = 1.655942, balance_frac_violations = 0.45
  • control vs neg_contact_flg_error_flg: balance_max_smd = 1.924802, balance_frac_violations = 0.65
  • Overall: 46.7% of feature-comparison cells are still above the default SMD threshold 0.10, and anything above 0.20 is already RED by the implementation in unconfoundedness_validation.py and unconfoundedness_validation.py.

Why This Happens This does not mean the DGP violates unconfoundedness. In this synthetic CX setup, unconfoundedness is true by construction because treatment and outcome depend only on observed X, with no hidden U effect by default in the shared generator at base.py and base.py. What fails is the practical balancing step induced by estimated propensities.

The reason is the treatment assignment is intentionally very sharp:

So the arms live in different parts of covariate space. Poor overlap means IPW cannot create balanced pseudo-populations because there just are not enough comparable controls and treated units to reweight into each other. That’s exactly why overlap and balance both fail.

Why ATE Is Still Close The estimate can still be near oracle because MultiTreatmentIRM is doubly robust / outcome-augmented, not pure IPW. Its score combines outcome regression g_hat with weighted residual terms in model.py. In this benchmark the treatment effect structure is very simple on the link scale, with theta=[0.0, 0.0, -0.65, -0.65] and no extra tau(X) in dgp.py. So g_hat can recover a lot even when weighting is weak.

That’s also why the pattern of errors makes sense:

  • neg_contact_flg has the least-bad overlap/balance and smallest absolute miss from oracle
  • neg_contact_flg_error_flg has the worst overlap/balance and the biggest miss

So the clean interpretation is:

  • overlap diagnostics: bad
  • balance diagnostics: bad even after trimming
  • oracle vs estimate: still close because this synthetic problem is easy for the outcome model, not because the design is well supported

Sensetivity analysis

Result
cf_yr2_yr2_drhotheta_longtheta_shortdelta
neg_contact_flg vs control0.0000130.0000130.0002210.998355-0.0043210.022369-0.026690
error_flg vs control0.0000130.0000130.0001050.945067-0.121649-0.090001-0.031648
neg_contact_flg_error_flg vs control0.0000130.0000130.0006440.955303-0.114820-0.047879-0.066942
Result

================== Bias-aware Interval ==================

------------------ Scenario ------------------ Significance Level: alpha=0.05 Null Hypothesis: H0=0.0 Sensitivity parameters: cf_y=0.05; r2_d=[0.05 0.05 0.05], rho=[1. 1. 1.], use_signed_rr=False

statistics value 0 neg_contact_flg vs control | bias_aware_ci [-0.1589, 0.1503] 1 neg_contact_flg vs control | theta [-0.1555, -0.0043, 0.1468] 2 neg_contact_flg vs control | sampling_ci [-0.0075, -0.0012] 3 neg_contact_flg vs control | rv 0.000000 4 neg_contact_flg vs control | rva 0.000000 5 neg_contact_flg vs control | se 0.001600 6 neg_contact_flg vs control | max_bias 0.151100 7 neg_contact_flg vs control | max_bias_base 2.946100 8 neg_contact_flg vs control | bound_width 0.151100 9 neg_contact_flg vs control | sigma2 0.050800 10 neg_contact_flg vs control | nu2 170.957700 11 error_flg vs control | bias_aware_ci [-0.2881, 0.0445] 12 error_flg vs control | theta [-0.2839, -0.1216, 0.0406] 13 error_flg vs control | sampling_ci [-0.1254, -0.1179] 14 error_flg vs control | rv 0.028700 15 error_flg vs control | rva 0.027000 16 error_flg vs control | se 0.001900 17 error_flg vs control | max_bias 0.162200 18 error_flg vs control | max_bias_base 3.162400 19 error_flg vs control | bound_width 0.162200 20 error_flg vs control | sigma2 0.050800 21 error_flg vs control | nu2 196.987700 22 neg_contact_flg_error_flg vs control | bias_aware_ci [-0.2943, 0.0642] 23 neg_contact_flg_error_flg vs control | theta [-0.2903, -0.1148, 0.0607] 24 neg_contact_flg_error_flg vs control | sampling_ci [-0.1182, -0.1114] 25 neg_contact_flg_error_flg vs control | rv 0.022000 26 neg_contact_flg_error_flg vs control | rva 0.020800 27 neg_contact_flg_error_flg vs control | se 0.001700 28 neg_contact_flg_error_flg vs control | max_bias 0.175500 29 neg_contact_flg_error_flg vs control | max_bias_base 3.420700 30 neg_contact_flg_error_flg vs control | bound_width 0.175500 31 neg_contact_flg_error_flg vs control | sigma2 0.050800 32 neg_contact_flg_error_flg vs control | nu2 230.474900

Summary

A regular contact/support request is not a problem, while submitting the application multiple times leads to a decrease in product utilization.