Skip to content
Scenario6 min read

Uncofoundedness

We call 'Uncofoundedness' a scenario where a treatment is not randomly assigned to participants, so confounders effect on treatment assignment and outcome. We h...

Uncofoundedness

We call 'Uncofoundedness' a scenario where a treatment is not randomly assigned to participants, so confounders effect on treatment assignment and outcome. We have client - level data. Confounders were measured before treatment and outcome after

Data

Let's look at the example:

In our ecosystem we have a product, which effect on LTV we want to estimate

Treatment - first purchase in product.

Outcome - LTV after first purchase.

We will test hypothesis:

HoH_o - There is no difference in LTV between treatment and control groups.

HaH_a - There is a difference in LTV between treatment and control groups.

We will use DGP from Causalis. Read more at https://causalis.causalcraft.com/articles/generate_obs_hte_26_rich

Result
user_idydtenure_monthsavg_sessions_weekspend_last_monthage_yearsincome_monthlyprior_purchases_12msupport_tickets_90dpremium_usermobile_userurban_residentreferred_usermm_obstau_linkg0g1cate
010.0000000.028.8146541.077.93676750.2341011926.6983011.02.01.01.01.00.00.0454530.0454530.0890958.1379819.1423951.004414
1280.0996111.025.9133453.053.77774028.1158595104.2715093.00.01.01.00.01.00.0415140.0415140.24667960.45925778.81730718.358049
236.4004821.024.96992910.0134.76432222.9070625267.9382558.03.00.01.01.00.00.0525930.0525930.1629687.7128559.1385771.425723
342.7882380.040.6550895.059.51707431.9704906597.3270183.02.01.01.01.00.00.0362210.0362210.18875525.38651031.1599325.773422
450.0000000.018.5608993.074.37093039.2372484930.0096285.01.01.01.00.00.00.0363430.0363430.17475715.35925018.6002273.240977
Result

Ground truth ATE is 19.40958652966079 Ground truth ATTE is 10.914991423363862

Result

CausalData(df=(100000, 14), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'age_years', 'income_monthly', 'prior_purchases_12m', 'support_tickets_90d', 'premium_user', 'mobile_user', 'urban_resident', 'referred_user'], user_id='user_id')

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.09505176.087138240.8007130.00.00.08.3954464.859278190.22790021396.007575
11.0494958.506172199.4856250.00.00.00.0000036.958280148.8371935143.642132

Our data has strong treatment class disbalance. Only 5% of sample activated in treatment.

Treatment group has lower mean LTV. It's too early to draw conclusions.

Result

png

We see large right tale

Result

png

Result
treatmentnoutlier_countoutlier_ratelower_boundupper_boundhas_outliersmethodtail
00.095051113000.118884-97.288916162.148194Trueiqrboth
11.049497210.145686-55.43742092.395699Trueiqrboth

We see many outliers. It's common situation for LTV metric. Dropping them will lead to a biased conclusion

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0premium_user0.7518070.5918370.159970-0.3457210.00000
1income_monthly4549.3851903918.058798631.326392-0.2776110.00000
2spend_last_month89.09180167.37538921.716412-0.2683600.00000
3support_tickets_90d0.9845451.2592440.2746990.2539740.00000
4avg_sessions_week5.0477534.2301480.817606-0.2017350.00000
5prior_purchases_12m3.9042203.5136390.390581-0.1893720.00000
6tenure_months28.74010025.5591613.180939-0.1841560.00000
7age_years36.43598434.8090831.626901-0.1441420.00000
8referred_user0.2714860.3071330.0356470.0786710.00001
9urban_resident0.6007930.5688020.031991-0.0649540.00013
10mobile_user0.8745730.8700750.004498-0.0134770.99998

As we see clients are differ on this confounders. We need to controll them to make causal inference

Inference

ATTE is right estimand here. We will estimate effect on clients that were treated, had first purchase in our product

Math Explanation of the IRM Model and ATTE Estimand

The Interactive Regression Model (IRM) is a flexible framework used in Double Machine Learning (DML) to estimate treatment effects. Unlike linear models, it allows the treatment effect to vary with confounders XX (interaction) and makes no parametric assumptions about the functional forms of the outcomes.

We write W=(Y,D,X)W=(Y,D,X) for an observation, where D{0,1}D\in\{0,1\} is treatment and YY is the observed outcome.

1. Nuisance Functions

The IRM framework relies on three "nuisance" components estimated from the data:

  • Outcome Regression (Control): g0(X)=E[YX,D=0]g_0(X) = \mathbb{E}[Y | X, D=0]
  • Outcome Regression (Treated): g1(X)=E[YX,D=1]g_1(X) = \mathbb{E}[Y | X, D=1]
  • Propensity Score: m(X)=P(D=1X)m(X) = \mathbb{P}(D=1 | X)

Let p=P(D=1)=E[D]p = \mathbb{P}(D=1) = \mathbb{E}[D] denote the overall treatment rate (estimated by the sample mean of DD).

In the provided implementation (irm.py), these are estimated using cross-fitting (splitting data into folds) to avoid overfitting bias.

2. ATTE (Average Treatment Effect on the Treated)

The Average Treatment Effect on the Treated (ATTE) measures the impact of the treatment specifically on those individuals who received it: θATTE=E[Y(1)Y(0)D=1]\theta_{ATTE} = \mathbb{E}[Y(1) - Y(0) \mid D=1]

Under unconfoundedness, (Y(1),Y(0))DX(Y(1),Y(0)) \perp D \mid X, and overlap 0<m(X)<10 < m(X) < 1, this is identified from observed data.

3. The Orthogonal Score

DML uses a Neyman-orthogonal score ψ\psi to ensure the estimator is robust to small errors in the nuisance function estimates. The score for ATTE is defined as: ψ(W;θ,η)=ψb(W;η)+ψa(W;η)θ\psi(W; \theta, \eta) = \psi_b(W; \eta) + \psi_a(W; \eta)\theta

To match the implementation in irm.py, define:

  • Residuals: u0=Yg0(X)u_0 = Y - g_0(X), u1=Yg1(X)u_1 = Y - g_1(X)
  • IPW terms: h1=Dm(X)h_1 = \frac{D}{m(X)}, h0=1D1m(X)h_0 = \frac{1-D}{1-m(X)}
  • Weights (ATTE): w=Dpw = \frac{D}{p} and wˉ=m(X)p\bar{w} = \frac{m(X)}{p} (the normalized form with E[w]=1\mathbb{E}[w]=1)

Then:

ψa(W;η)=w=Dpψb(W;η)=w(g1(X)g0(X))+wˉ(u1h1u0h0)\begin{aligned} \psi_a(W;\eta) &= -w = -\frac{D}{p} \\ \psi_b(W;\eta) &= w\,(g_1(X)-g_0(X)) + \bar{w}\,(u_1 h_1 - u_0 h_0) \end{aligned}

(If normalize_ipw=True, the code rescales h1h_1 and h0h_0 to have mean 1.)

4. Final Estimation (Step-by-step simplification)

For brevity, write m=m(X)m = m(X), g0=g0(X)g_0 = g_0(X), and g1=g1(X)g_1 = g_1(X). Plug in w,wˉ,h1,h0w, \bar{w}, h_1, h_0:

ψb=Dp(g1g0)mp[Dm(Yg1)1D1m(Yg0)] =Dp(g1g0)+Dp(Yg1)mp1D1m(Yg0) =Dp(Yg0)mp1D1m(Yg0).\begin{aligned} \psi_b &= \frac{D}{p}(g_1-g_0)• \frac{m}{p}\left[\frac{D}{m}(Y-g_1) - \frac{1-D}{1-m}(Y-g_0)\right] \ &= \frac{D}{p}(g_1-g_0) + \frac{D}{p}(Y-g_1) - \frac{m}{p}\frac{1-D}{1-m}(Y-g_0) \ &= \frac{D}{p}(Y-g_0) - \frac{m}{p}\frac{1-D}{1-m}(Y-g_0). \end{aligned}

So the g1(X)g_1(X) terms cancel, and the ATTE score depends only on g0(X)g_0(X) and m(X)m(X). The estimator solves E[ψ(W;θ,η)]=0\mathbb{E}[\psi(W;\theta,\eta)]=0:

θ^ATTE=E[ψb]E[ψa]=E[ψb]E[D/p]=E[ψb]. \begin{aligned} \hat{\theta}_{ATTE} &= \frac{\mathbb{E}[\psi_b]}{\mathbb{E}[-\psi_a]} = \frac{\mathbb{E}[\psi_b]}{\mathbb{E}[D/p]} = \mathbb{E}[\psi_b]. \end{aligned}

Equivalently, θ^ATTE=E[Dp(Yg0(X))m(X)p1D1m(X)(Yg0(X))].\hat{\theta}_{ATTE} = \mathbb{E}\left[\frac{D}{p}(Y-g_0(X)) - \frac{m(X)}{p}\frac{1-D}{1-m(X)}(Y-g_0(X))\right].

Result
value
field
estimandATTE
modelIRM
value12.9311 (ci_abs: 2.8182, 23.0440)
value_relative28.3732 (ci_rel: 0.9252, 55.8212)
alpha0.0500
p_value0.0122
is_significantTrue
n_treated4949
n_control95051
treatment_mean58.5062
control_mean76.0871
time2026-04-11

Our estimate is 12.1542 dollars (ci_abs: 7.7933, 16.5152). Mean in treatment group is 58.5062 dollars, so without our product it would be 46.3520 dollars.

Refutation

Unconfoundedness

Result

png

balance_max_smd is 0.011635 so dml specification dealt with controlling confounders

Sensitivity

Result
benchmark_confounderr2_yr2_drhotheta_longtheta_shortdelta
0tenure_months0.0004600.0403771.012.93109912.7908410.140258
1premium_user0.0007030.2280051.012.9310998.5492004.381898
Result
statisticsvalue
0bias_aware_ci[1.0396, 24.2366]
1theta[11.4505, 12.9311, 14.4117]
2sampling_ci[2.8182, 23.044]
3rv0.0173
4rva0.0038
5se5.1597
6max_bias1.4806
7max_bias_base733.9425
8bound_width1.4806
9sigma233525.4744
10nu216.0675

Even if we have latent confounder as strong as 'tenure_months' our estimate will be > 0 with bias_aware_ci': (4.646164852659804, 18.5265291001404)

SUTVA

Result

1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?

SUTVA is untestable from data alone, so we call it true by design

Score

Result
metricvalueflag
0psi_p99_over_med65.295592RED
1psi_kurtosis35795.761151RED
2max_|t|_g01.213944GREEN
3max_|t|_m1.127975GREEN
4oos_max_abs_t0.000239GREEN
Result

png

Result

png

DML is specified correctly. There are many outliers in data that effect the score

Overlap

Result

png

Customers are not inclined to activate the our product

Result
metricvalueflag
0edge_0.01_below0.018420GREEN
1edge_0.01_above0.000000GREEN
4KS0.185908GREEN
6ESS_treated_ratio0.532892GREEN
7ESS_control_ratio0.986297GREEN
10ATT_identity_relerr0.005106GREEN
12calib_ECE0.005432GREEN

if calib_ECE is YELLOW/RED look at plot_propensity_reliability(dml_result)

Result

png

Logistic recalibration trend is a fitted correction curve for your predicted propensities.

The plot takes your original propensity p and fits this model:

corrected_p = sigmoid(alpha + beta * logit(p)) So the orange line answers:

“If I recalibrate these predicted probabilities to better match observed treatment frequencies, what corrected probability would I use?”

How to read it:

  • If orange line matches the diagonal: predictions are already well calibrated
  • If orange line is below the diagonal: model tends to overpredict treatment
  • If orange line is above the diagonal: model tends to underpredict treatment

Rule of thumb:

  • ECE near 0 and slope near 1: great
  • ECE near 0 but slope far from 1: average calibration is fine, but probabilities are distorted in shape
  • Many tiny bad bins + one huge good bin: visually noisy, but often not a major practical issue

Conclusion

First purchase in our product is increasing LTV 11.7542 (ci_abs: 7.1434, 16.3651) dollars. Model is specified correctly and there is no evidence that assumptions are false