Research4 min read

generate_cuped_tweedie_26

Automated conversion of generate_cuped_tweedie_26.ipynb

generate_cuped_tweedie_26

The generate_cuped_tweedie_26 data generating process (DGP) creates a synthetic dataset characterized by a Tweedie-like outcome (zero-inflated with a heavy right tail), correlated confounders, and structured heterogeneous treatment effects (HTE). It also includes pre-period covariates (y_pre and y_pre_2) calibrated for CUPED benchmarks.

1. Confounders

Five confounders are generated using a Gaussian Copula to induce specific correlations:

  • tenure_months (X1X_1): Lognormal(μ=2.5,σ=0.5)\text{Lognormal}(\mu=2.5, \sigma=0.5)
  • avg_sessions_week (X2X_2): NegativeBinomial(μ=5,α=0.5)\text{NegativeBinomial}(\mu=5, \alpha=0.5)
  • spend_last_month (X3X_3): Lognormal(μ=4.0,σ=0.8)\text{Lognormal}(\mu=4.0, \sigma=0.8)
  • discount_rate (X4X_4): Beta(mean=0.1,κ=20)\text{Beta}(\text{mean}=0.1, \kappa=20)
  • platform (X5X_5): Categorical with levels android (65%), ios (30%), web (5%).

Correlations: corr(X1,X2)=0.4\text{corr}(X_1, X_2) = 0.4 and corr(X2,X3)=0.5\text{corr}(X_2, X_3) = 0.5.

2. Treatment Assignment

The treatment DD is assigned via a Bernoulli trial with a constant propensity score (RCT): DBernoulli(0.5)D \sim \text{Bernoulli}(0.5) (Note: While the generator supports complex propensity models, generate_cuped_tweedie_26 defaults to a balanced random assignment).

3. Outcome Model

The outcome is generated as a two-part (hurdle) process: Y=IYposY = I \cdot Y_{pos}

A. Binary Indicator of Non-zero Outcome (II)

IBernoulli(σ(αzi+uziU))I \sim \text{Bernoulli}(\sigma(\alpha_{zi} + u_{zi} \cdot U))

  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.
  • αzi=0.0\alpha_{zi} = 0.0 (about 50% baseline non-zero rate).
  • uzi=1.0u_{zi} = 1.0 when add_pre=True, and 00 otherwise.

B. Positive Outcome Value (YposY_{pos})

If I=1I=1, the value is drawn from a Gamma distribution: YposGamma(shape=k,scale=μpos/k)Y_{pos} \sim \text{Gamma}(\text{shape}=k, \text{scale}=\mu_{pos} / k)

  • k=2.0k = 2.0 (shape parameter).
  • μpos=exp(loc)\mu_{pos} = \exp(\text{loc}), where loc\text{loc} is the log-mean linear predictor: loc=αy+Dτ(X)+uyU\text{loc} = \alpha_y + D \cdot \tau(X) + u_y \cdot U
  • αy=2.0\alpha_y = 2.0.
  • uy=1.0u_y = 1.0 when add_pre=True, and 00 otherwise.
  • UN(0,1)U \sim \mathcal{N}(0,1) is a shared latent prognostic driver used in the outcome/pre construction (treatment remains randomized because ud=0u_d=0).

4. Heterogeneous Treatment Effect

The treatment effect τ(X)\tau(X) is defined on the log-mean scale and incorporates monotone effects, diminishing returns, and categorical modifiers: τ(X)=clip(θloggt(X1)gs(X2)seg(X5),0,2.5θlog)\tau(X) = \text{clip}\left(\theta_{log} \cdot g_t(X_1) \cdot g_s(X_2) \cdot \text{seg}(X_5), 0, 2.5 \cdot \theta_{log}\right) Where:

  • gt(X1)=11+exp(0.25(X112))g_t(X_1) = \frac{1}{1 + \exp(-0.25 \cdot (X_1 - 12))} (saturating effect of tenure).
  • gs(X2)=11+exp(0.35(X24))g_s(X_2) = \frac{1}{1 + \exp(-0.35 \cdot (X_2 - 4))} (diminishing returns of sessions).
  • seg(X5)=1+0.31platform=ios\text{seg}(X_5) = 1 + 0.3 \cdot \mathbb{1}_{\text{platform}=\text{ios}} (premium segment modifier).
  • generate_cuped_tweedie_26 uses θlog=0.38\theta_{log}=0.38 by default.

5. Pre-period Covariate

The first pre-period covariate uses the same latent driver AN(0,1)A \sim \mathcal{N}(0,1) that also drives outcome stochasticity when add_pre=True: ypre=base1+σnoiseϵ,ϵN(0,1)y_{pre} = base_1 + \sigma_{noise} \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1) where base1base_1 is built from the Tweedie two-part mechanism, and σnoise\sigma_{noise} is numerically calibrated so that \ncorr(ypre,yd=0)ρtarget,1\mathrm{corr}(y_{pre}, y \mid d=0) \approx \rho_{target,1} with default ρtarget,1=0.82\rho_{target,1}=0.82 in generate_cuped_tweedie_26.

6. Second Pre-period Covariate

The second pre-period covariate is constructed from a second noisy view of the same latent driver AA (not an orthogonalized independent latent):

A. Latent-driven two-part base

With standardized covariates tenurez,sessionsz,spendz,discountztenure_z, sessions_z, spend_z, discount_z and a=z(A)a=z(A): zi_score=0.20+1.05a+0.10tenurez0.08discountzzi\_score = 0.20 + 1.05a + 0.10\,tenure_z - 0.08\,discount_z ppos,2=σ(zi_score),I2Bernoulli(ppos,2)p_{pos,2} = \sigma(zi\_score), \quad I_2 \sim \text{Bernoulli}(p_{pos,2}) loc2=1.90+0.95a+0.10sessionsz+0.08spendzloc_2 = 1.90 + 0.95a + 0.10\,sessions_z + 0.08\,spend_z μ2=eloc2,Y2,+Gamma(shape=2.3,scale=μ2/2.3)\mu_2 = e^{loc_2}, \quad Y_{2,+} \sim \text{Gamma}(shape=2.3, scale=\mu_2/2.3) base2=I2Y2,+base_2 = I_2 \cdot Y_{2,+}

B. Feasibility blending and calibration

If corr(base2,yd=0)\mathrm{corr}(base_2, y \mid d=0) is below the second target, the base is blended with yprey_{pre}: base2mix=(1w)base2+wypre,w[0.10,1.00]base_2^{mix} = (1-w)\,base_2 + w\,y_{pre}, \quad w \in [0.10,1.00] using the first ww on a grid that reaches the target (or the best achievable one). Then Gaussian noise is calibrated as: ypre,2=base2mix+σ2ϵ2,ϵ2N(0,1)y_{pre,2} = base_2^{mix} + \sigma_2 \cdot \epsilon_2, \quad \epsilon_2 \sim \mathcal{N}(0,1) to match the requested second control-group correlation. If pre_target_corr_2 is omitted, the default target is ρtarget,2=min(ρtarget,1,min(0.72,ρtarget,10.10)).\rho_{target,2}=\min(\rho_{target,1},\,\min(0.72,\rho_{target,1}-0.10)).

Result
ydtenure_monthsavg_sessions_weekspend_last_monthdiscount_rateplatform_iosplatform_webmm_obstau_linkg0g1catey_pre_latent_Ay_pre_2
03.7347630.014.1874612.057.3553000.1581641.00.00.50.50.1038258.4879669.4166050.92863917.9721380.30471710.783283
10.7464061.06.3528933.046.7009460.0857220.00.00.50.50.0307818.4879668.7533010.2653350.000000-1.0399840.000000
213.0405841.018.9101539.080.1361870.1751151.00.00.50.50.3573558.48796612.1339153.64594934.7718370.75045124.866330
334.5821131.07.9276274.033.7182240.1527181.00.00.50.50.0655548.4879669.0630250.575059349.1639430.940565209.498366
40.0000001.011.1069252.092.0645180.0773900.00.00.50.50.0560368.4879668.9771720.4892060.000000-1.9510350.243980
Result

Ground truth ATE is 1.2383515933360814 Ground truth ATTE is 1.231529853520104

Result

CausalData(df=(20000, 10), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'discount_rate', 'platform_ios', 'platform_web', 'y_pre', 'y_pre_2'])

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.0100498.86713621.0975990.00.00.00.2769019.26613424.454852347.095992
11.099519.87018825.8790150.00.00.00.00000010.40991627.439125956.413897
Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0tenure_months13.84268613.6159060.226781-0.0308330.01991
1spend_last_month73.78766775.1590171.3713500.0197260.20812
2platform_ios0.3043090.2973570.006952-0.0151580.96771
3y_pre7928.51644820967.67242713039.1559790.0146210.36084
4y_pre_24760.19212912583.7092027823.5170730.0146210.38473
5avg_sessions_week4.9633795.0156770.0522970.0124050.49401
6discount_rate0.1004200.0999960.000424-0.0064210.36146
7platform_web0.0509500.0507490.000202-0.0009181.00000
Result

png

Result

png