Scenario6 min read

CUPED

Automated conversion of cuped.ipynb

CUPED

We call 'Controlled-experiment Using Pre-Experiment Data' (CUPED) a scenario where a treatment is randomly assigned to participants, and we have pre-experiment data of participants like pre-treatment outcome.

Treatment - new product category for users.

We will test hypothesis:

HoH_o - There is no difference in LTV between treatment and control groups.

HaH_a - There is a difference in LTV between treatment and control groups.

Data

We will use DGP from Causalis. Read more at https://causalis.causalcraft.com/articles/make_cuped_tweedie_26

Result
ydtenure_monthsavg_sessions_weekspend_last_monthdiscount_rateplatform_iosplatform_webmm_obstau_linkg0g1catey_pre_latent_Ay_pre_2
03.7347630.014.1874612.057.3553000.1581641.00.00.50.50.1038258.4879669.4166050.92863917.9721380.30471710.783283
10.7464061.06.3528933.046.7009460.0857220.00.00.50.50.0307818.4879668.7533010.2653350.000000-1.0399840.000000
213.0405841.018.9101539.080.1361870.1751151.00.00.50.50.3573558.48796612.1339153.64594934.7718370.75045124.866330
334.5821131.07.9276274.033.7182240.1527181.00.00.50.50.0655548.4879669.0630250.575059349.1639430.940565209.498366
40.0000001.011.1069252.092.0645180.0773900.00.00.50.50.0560368.4879668.9771720.4892060.000000-1.9510350.243980
Result

Ground truth ATE is 1.2383515933360814 Ground truth ATTE is 1.231529853520104

Result

CausalData(df=(20000, 9), treatment='d', outcome='y', confounders=['avg_sessions_week', 'spend_last_month', 'discount_rate', 'platform_ios', 'platform_web', 'y_pre', 'y_pre_2'])

EDA

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.0100498.86713621.0975990.00.00.00.2769019.26613424.454852347.095992
11.099519.87018825.8790150.00.00.00.00000010.40991627.439125956.413897
Result

png

Result

png

Result
treatmentnoutlier_countoutlier_ratelower_boundupper_boundhas_outliersmethodtail
00.01004910800.107473-13.89920123.165335Trueiqrboth
11.0995110630.106823-15.61487526.024791Trueiqrboth

We see heavy tale distribution

Monitoring

Some system is randomly splitting users. Half must have new onboarding, other half has not. We should monitor the split with SRM test. Read more at https://causalis.causalcraft.com/articles/srm

Result

SRMResult(status=no SRM, p_value=0.48833, chi2=0.4802)

Now let's check the balace of confounders

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0spend_last_month73.78766775.1590171.3713500.0197260.20812
1platform_ios0.3043090.2973570.006952-0.0151580.96771
2y_pre7928.51644820967.67242713039.1559790.0146210.36084
3y_pre_24760.19212912583.7092027823.5170730.0146210.38473
4avg_sessions_week4.9633795.0156770.0522970.0124050.49401
5discount_rate0.1004200.0999960.000424-0.0064210.36146
6platform_web0.0509500.0507490.000202-0.0009181.00000
  • SRM is good
  • SMD < 0.1 and ks_pvalue > 0.05

Split is random. Uncofoundedness is true

Inference

Math of CUPEDModel

The CUPEDModel implements the Lin (2013) "interacted adjustment" for ATE (Average Treatment Effect) estimation in randomized controlled trials (RCTs). This method is a robust version of ANCOVA that remains valid even when the treatment effect is heterogeneous with respect to the covariates.

1. Specification

The model fits an Ordinary Least Squares (OLS) regression of the outcome YY on the treatment indicator DD and centered pre-treatment covariates XcX^c. The specification includes full interactions between the treatment and the centered covariates:

Yi=α+τDi+βTXic+γT(DiXic)+ϵiY_i = \alpha + \tau D_i + \beta^T X_i^c + \gamma^T (D_i \cdot X_i^c) + \epsilon_i

Where:

  • YiY_i: Outcome for individual ii.
  • DiD_i: Binary treatment indicator (Di{0,1}D_i \in \{0, 1\}).
  • XiX_i: Vector of pre-treatment covariates.
  • Xic=XiXˉX_i^c = X_i - \bar{X}: Centered covariates (where Xˉ\bar{X} is the sample mean).
  • α\alpha: Intercept (represents the mean outcome of the control group when X=XˉX = \bar{X}).
  • τ\tau: Average Treatment Effect (ATE) or Intent-to-Treat (ITT) effect.
  • β\beta: Vector of coefficients for the main effects of the covariates.
  • γ\gamma: Vector of coefficients for the interaction terms between treatment and covariates.
  • ϵi\epsilon_i: Residual error term.

2. Why Centering and Interaction?

  • Centering (XcX^c): By centering the covariates, the coefficient τ\tau directly represents the ATE at the average value of the covariates.
  • Interactions (DXcD \cdot X^c): Including interactions (as proposed by Lin, 2013) ensures that τ\tau is a consistent estimator of the population ATE even if the true treatment effect varies with XX (heterogeneity). In traditional ANCOVA without interactions, the estimator for τ\tau can be biased or less efficient under heterogeneity.

3. Inference and Variance Reduction

The model uses robust covariance estimators (defaulting to HC3) to calculate standard errors, which accounts for potential heteroscedasticity:

  • Standard Error (SESE): Derived from the robust covariance matrix of the OLS fit.
  • Variance Reduction: The main goal of CUPED is to reduce the variance of the ATE estimate by "soaking up" explainable variation in YY using pre-treatment data XX. The variance reduction percentage is calculated by comparing the variance of the adjusted model to a "naive" model (Y1+DY \sim 1 + D): Variance Reduction %=1Var(τ^adjusted)Var(τ^naive)\text{Variance Reduction \%} = 1 - \frac{Var(\hat{\tau}_{adjusted})}{Var(\hat{\tau}_{naive})}

4. Absolute and Relative Confidence Intervals

Absolute Confidence Interval The 1α1-\alpha confidence interval for the ATE τ^\hat{\tau} is: CIabs=[τ^zcritSE(τ^),τ^+zcritSE(τ^)]CI_{abs} = [\hat{\tau} - z_{crit} \cdot SE(\hat{\tau}), \hat{\tau} + z_{crit} \cdot SE(\hat{\tau})] Where zcritz_{crit} is the critical value for the significance level α\alpha.

Relative Effect The relative effect is the ATE expressed as a percentage of the control group mean: τrel=100%τ^YˉC\tau_{rel} = 100\% \cdot \frac{\hat{\tau}}{\bar{Y}_C} where YˉC\bar{Y}_C is the sample mean of the control group.

Relative Confidence Interval (Delta Method) The variance of the relative effect is estimated using the Delta Method. Assuming τ^\hat{\tau} and YˉC\bar{Y}_C are approximately independent, the variance is: Var(τrel)(100YˉC)2Var(τ^)+(100τ^YˉC2)2Var(YˉC)Var(\tau_{rel}) \approx \left( \frac{100}{\bar{Y}_C} \right)^2 Var(\hat{\tau}) + \left( \frac{-100 \cdot \hat{\tau}}{\bar{Y}_C^2} \right)^2 Var(\bar{Y}_C) The confidence interval for the relative effect is then: CIrel=[τrelzcritVar(τrel),τrel+zcritVar(τrel)]CI_{rel} = [\tau_{rel} - z_{crit} \cdot \sqrt{Var(\tau_{rel})}, \tau_{rel} + z_{crit} \cdot \sqrt{Var(\tau_{rel})}]

Result
value
field
estimandATE
modelCUPEDModel
value0.8155 (ci_abs: 0.2724, 1.3587)
value_relative9.1973 (ci_rel: 3.0567, 15.3378)
alpha0.0500
p_value0.0033
is_significantTrue
n_treated9951
n_control10049
treatment_mean9.8702
control_mean8.8671
time2026-02-20
Result

png

Result

var reduction with CUPED %: 31.179433556787917

Refutation

Unconfoundedness

True by design. Proxy tested with SRM and balance check in Monitoring section

SUTVA

Result

1.) Are your clients independent (i). Outcome of ones do not depend on others? 2.) Are all clients have full window to measure metrics? 3.) Do you measure confounders before treatment and outcome after? 4.) Do you have a consistent label of treatment, such as if a person does not receive a treatment, he has a label 0?

Overlap

Overlap is true by design

Regression specification

Result
 test_idtestflagvaluethresholdmessage
0design_rankDesign rankGREENrank=6, k=6rank == kDesign matrix is full rank.
1condition_numberCondition numberGREEN3335417.561616<= 1.000e+08Condition number is within expected range.
2near_duplicatesNear-duplicate covariatesYELLOW1red if >= 3Near-duplicate covariate pairs detected.
3vifVariance inflation factorRED6484907465.751774yellow: > 20, red: > 40Very large VIF indicates severe multicollinearity.
4ate_gapAdjusted vs naive ATEGREEN0.561334yellow: > 2.00, red: > 2.50Adjusted and naive ATE are reasonably aligned.
5residual_tailsResidual extremesREDmax|std resid|=27.8yellow > 7, red > 10Extremely large standardized residuals; outliers likely dominate.
6leverageLeverageREDmax_h=0.91, n_high=653yellow if max_h > 50.0006, red if max_h > max(0.5, 100.0006)Extreme leverage points detected.
7cooksCook's distanceREDmax=1301, n_high=778yellow if max Cook's > 0.1, red if > 1Strongly influential observations detected.
8hc23_stabilityHC2/HC3 stabilityGREENmin(1-h)=9.004e-02, n_tiny=0min(1-h) >= 1.0e-06HC2/HC3 stability check passed.
9winsor_sensitivityWinsor sensitivityGREEN0.304010yellow: > 1.00 SE, red: > 2.00 SEWinsorized refit is close to baseline ATE.