Research3 min read

generate_cuped_binary_26

Automated conversion of generate_cuped_binary.ipynb

generate_cuped_binary_26

The make_cuped_binary_26 data generating process (DGP) creates a synthetic binary-outcome dataset with randomized treatment, richer confounders, structured heterogeneous treatment effects (HTE), and a calibrated pre-period covariate (y_pre) for CUPED benchmarking.

1. Confounders

The DGP uses mixed-type confounders sampled from independent marginals:

  • tenure_months (X1X_1): Lognormal(μ=2.5,σ=0.5)\text{Lognormal}(\mu=2.5, \sigma=0.5)
  • spend_last_month (X2X_2): Lognormal(μ=4.0,σ=0.8)\text{Lognormal}(\mu=4.0, \sigma=0.8)
  • discount_rate (X3X_3): Beta(mean=0.1,κ=20)\text{Beta}(\text{mean}=0.1, \kappa=20)
  • support_tickets (X4X_4): Poisson(λ=1.8)\text{Poisson}(\lambda=1.8)
  • email_open_rate (X5X_5): Beta(mean=0.45,κ=14)\text{Beta}(\text{mean}=0.45, \kappa=14)
  • referral_count (X6X_6): Poisson(λ=0.8)\text{Poisson}(\lambda=0.8)
  • plan_tier (X7X_7): Categorical with levels free (55%), plus (30%), pro (15%), encoded as plan_tier_plus and plan_tier_pro.
  • region (X8X_8): Categorical with levels na (80%), eu (20%), encoded as region_eu.

2. Treatment Assignment

Treatment is randomized with constant propensity: DBernoulli(0.5)D \sim \text{Bernoulli}(0.5) This is implemented with αd=logit(0.5)\alpha_d = \text{logit}(0.5) and no XX- or latent-driven treatment terms.

3. Outcome Model

The outcome is binary with a logistic link: YBernoulli(p),logit(p)=αy+gy(X)+λU+Dτ(X)Y \sim \text{Bernoulli}(p), \quad \text{logit}(p)=\alpha_y + g_y(X) + \lambda U + D \cdot \tau(X)

  • αy=1.4\alpha_y=-1.4.
  • UN(0,1)U \sim \mathcal{N}(0,1) is a shared latent prognostic signal used only in the outcome equation (so treatment remains randomized/unconfounded).
  • λ=1.2\lambda=1.2 when add_pre=True, and λ=0\lambda=0 otherwise.
  • Baseline score gy(X)g_y(X) combines monotone transformations of confounders and segment indicators.

4. Heterogeneous Treatment Effect

Treatment effect is defined on the log-odds scale: τ(X)=clip(θlogitgtgspgogticketgrefseg,0,2θlogit)\tau(X)=\text{clip}(\theta_{logit} \cdot g_t \cdot g_{sp} \cdot g_o \cdot g_{ticket} \cdot g_{ref} \cdot seg, 0, 2\theta_{logit}) Where:

  • gt=σ(0.20(tenure10))g_t = \sigma(0.20 \cdot (\text{tenure} - 10))
  • gsp=σ(0.55(log(1+spend)4))g_{sp} = \sigma(0.55 \cdot (\log(1+\text{spend}) - 4))
  • go=0.85+0.35email_open_rateg_o = 0.85 + 0.35 \cdot \text{email\_open\_rate}
  • gticket=11+exp(0.45(support_tickets2.5))g_{ticket} = \frac{1}{1+\exp(0.45 \cdot (\text{support\_tickets}-2.5))}
  • gref=0.90+0.18tanh(0.8referral_count)g_{ref} = 0.90 + 0.18 \cdot \tanh(0.8 \cdot \text{referral\_count})
  • seg=1+0.121plan=plus+0.261plan=pro+0.061region=euseg = 1 + 0.12\,\mathbb{1}_{plan=plus} + 0.26\,\mathbb{1}_{plan=pro} + 0.06\,\mathbb{1}_{region=eu}
  • θlogit=0.38\theta_{logit}=0.38 by default.

5. Pre-period Covariate

If add_pre=True, first define a centered probability-scale baseline signal: s(X,U)=σ ⁣(αy+gy(X)+λU)E[σ ⁣(αy+gy(X)+λU)]s(X,U)=\sigma\!\left(\alpha_y + g_y(X) + \lambda U\right)-\mathbb{E}\left[\sigma\!\left(\alpha_y + g_y(X) + \lambda U\right)\right] Then construct: ypre=s(X,U)+σnoiseϵ,ϵN(0,1)y_{pre}=s(X,U)+\sigma_{noise}\epsilon, \quad \epsilon \sim \mathcal{N}(0,1) The noise scale is calibrated numerically so that corr(ypre,y)\text{corr}(y_{pre}, y) in the control group matches the target correlation (default 0.650.65).

Result
ydtenure_monthsspend_last_monthdiscount_ratesupport_ticketsemail_open_ratereferral_countplan_tier_plusplan_tier_proregion_eumm_obstau_linkg0g1catey_pre
00.00.014.18746162.8670150.0873952.00.4122142.00.00.00.00.50.50.0815080.2676990.2806820.0129830.009942
10.00.07.242801112.0896260.1739392.00.1319570.00.00.00.00.50.50.0373010.2231800.2284830.005303-0.220081
21.01.017.72942316.8135230.1419002.00.3763930.01.00.00.00.50.50.0604420.2457170.2548610.0091440.096021
30.00.019.49742443.4565450.1466942.00.6806021.00.00.00.00.50.50.0961950.2690240.2844220.0153980.188115
40.00.04.592766105.9876560.1123266.00.6421130.01.00.01.00.50.50.0111410.2640300.2657710.001741-0.250511
Result

Ground truth ATE is 0.011857450138243646 Ground truth ATTE is 0.011877088659875997

Result

CausalData(df=(10000, 12), treatment='d', outcome='y', confounders=['tenure_months', 'spend_last_month', 'discount_rate', 'support_tickets', 'email_open_rate', 'referral_count', 'plan_tier_plus', 'plan_tier_pro', 'region_eu', 'y_pre'])

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.050340.2769170.4475200.00.00.00.01.01.01.0
11.049660.2980270.4574370.00.00.00.01.01.01.0
Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0region_eu0.2103690.1890860.021284-0.0532500.20331
1spend_last_month77.92116775.1115952.809573-0.0393810.68313
2plan_tier_pro0.1515690.1610950.0095260.0262300.97481
3support_tickets1.8178391.7895690.028270-0.0209300.72761
4referral_count0.7973780.7851390.012239-0.0136961.00000
5tenure_months13.78050013.7322160.048284-0.0065230.69284
6plan_tier_plus0.2989670.2962140.002753-0.0060201.00000
7discount_rate0.1003750.1005140.0001390.0021230.34966
8y_pre-0.0000640.0000650.0001290.0006120.64056
9email_open_rate0.4477860.4478050.0000200.0001550.95826
Result

png