Research3 min read

Make Cuped Tweedie 26

The `make_cuped_tweedie_26` data generating process (DGP) creates a synthetic dataset characterized by a Tweedie-like outcome (zero-inflated with a heavy right ...

The make_cuped_tweedie_26 data generating process (DGP) creates a synthetic dataset characterized by a Tweedie-like outcome (zero-inflated with a heavy right tail), correlated confounders, and structured heterogeneous treatment effects (HTE). It also includes a pre-period covariate (y_pre) calibrated for CUPED benchmarks.

1. Confounders (XX)

Five confounders are generated using a Gaussian Copula to induce specific correlations:

  • tenure_months (X1X_1): Lognormal(μ=2.5,σ=0.5)\text{Lognormal}(\mu=2.5, \sigma=0.5)
  • avg_sessions_week (X2X_2): NegativeBinomial(μ=5,α=0.5)\text{NegativeBinomial}(\mu=5, \alpha=0.5)
  • spend_last_month (X3X_3): Lognormal(μ=4.0,σ=0.8)\text{Lognormal}(\mu=4.0, \sigma=0.8)
  • discount_rate (X4X_4): Beta(mean=0.1,κ=20)\text{Beta}(\text{mean}=0.1, \kappa=20)
  • platform (X5X_5): Categorical with levels android (65%), ios (30%), web (5%).

Correlations: corr(X1,X2)=0.4\text{corr}(X_1, X_2) = 0.4 and corr(X2,X3)=0.5\text{corr}(X_2, X_3) = 0.5.

2. Treatment Assignment (DD)

The treatment DD is assigned via a Bernoulli trial with a constant propensity score (RCT): DBernoulli(0.5)D \sim \text{Bernoulli}(0.5) (Note: While the generator supports complex propensity models, make_cuped_tweedie_26 defaults to a balanced random assignment).

3. Outcome Model (YY)

The outcome is generated as a two-part (hurdle) process: Y=IYposY = I \cdot Y_{pos}

A. Binary Indicator of Non-zero Outcome (II)

IBernoulli(σ(αzi+u_strengthziU))I \sim \text{Bernoulli}(\sigma(\alpha_{zi} + \text{u\_strength}_{zi} \cdot U))

  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.
  • αzi=0.0\alpha_{zi} = 0.0 (resulting in ~50% baseline non-zero rate).
  • u_strengthzi=1.0\text{u\_strength}_{zi} = 1.0 (if add_pre=True).
  • UN(0,1)U \sim \mathcal{N}(0, 1) is an unobserved confounder.

B. Positive Outcome Value (YposY_{pos})

If I=1I=1, the value is drawn from a Gamma distribution: YposGamma(shape=k,scale=μpos/k)Y_{pos} \sim \text{Gamma}(\text{shape}=k, \text{scale}=\mu_{pos} / k)

  • k=2.0k = 2.0 (shape parameter).
  • μpos=exp(loc)\mu_{pos} = \exp(\text{loc}), where loc\text{loc} is the linear predictor on the log-mean scale: loc=αy+Dτ(X)+u_strengthyU\text{loc} = \alpha_y + D \cdot \tau(X) + \text{u\_strength}_y \cdot U
  • αy=2.0\alpha_y = 2.0.
  • u_strengthy=1.0\text{u\_strength}_y = 1.0 (if add_pre=True).

4. Heterogeneous Treatment Effect (τ(X)\tau(X))

The treatment effect τ(X)\tau(X) is defined on the log-mean scale and incorporates monotone effects, diminishing returns, and categorical modifiers: τ(X)=clip(θloggt(X1)gs(X2)seg(X5),0,2.5θlog)\tau(X) = \text{clip}\left(\theta_{log} \cdot g_t(X_1) \cdot g_s(X_2) \cdot \text{seg}(X_5), 0, 2.5 \cdot \theta_{log}\right) Where:

  • gt(X1)=11+exp(0.25(X112))g_t(X_1) = \frac{1}{1 + \exp(-0.25 \cdot (X_1 - 12))} (Saturating effect of tenure).
  • gs(X2)=11+exp(0.35(X24))g_s(X_2) = \frac{1}{1 + \exp(-0.35 \cdot (X_2 - 4))} (Diminishing returns of sessions).
  • seg(X5)=1+0.31platform=ios\text{seg}(X_5) = 1 + 0.3 \cdot \mathbb{1}_{\text{platform}=\text{ios}} (Premium segment modifier).
  • θlog=0.12\theta_{log} = 0.12 (Default log-uplift parameter).

5. Pre-period Covariate (yprey_{pre})

The pre-period covariate is generated using the same two-part structure as the outcome but replaces the unobserved confounder UU with a shared latent driver AN(0,1)A \sim \mathcal{N}(0, 1): ypre=(IpreYpos,pre)+σnoiseϵy_{pre} = (I_{pre} \cdot Y_{pos, pre}) + \sigma_{noise} \cdot \epsilon

  • ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1).
  • The influence of the shared driver AA and the noise scale σnoise\sigma_{noise} are calibrated (via numerical optimization) to ensure yprey_{pre} achieves a target correlation (default 0.60.6) with the post-period outcome YY in the control group.
Result
ydtenure_monthsavg_sessions_weekspend_last_monthdiscount_rateplatform_iosplatform_webmm_obstau_linkg0g1catey_pre
00.0000000.014.1874612.057.3553000.1581640.00.00.50.50.0420353.6945283.8531360.1586080.000000
10.0000001.06.3528933.046.7009460.0857220.00.00.50.50.0162013.6945283.7548700.0603420.000000
212.9189100.018.9101539.080.1361870.1751151.00.00.50.50.1880823.6945284.4590440.764516219.374863
313.0793121.07.9276274.033.7182240.1527180.00.00.50.50.0265403.6945283.7938930.0993650.000000
40.0000000.011.1069252.092.0645180.0773900.00.00.50.50.0294923.6945283.8051110.1105830.000000
Result

Ground truth ATE is 0.2700720823988466 Ground truth ATTE is 0.27094666120202526

Result

CausalData(df=(100000, 9), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'discount_rate', 'platform_ios', 'platform_web', 'y_pre'])

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.0498688.49683020.9075550.00.00.00.08.90300523.685134637.127367
11.0501329.05630121.7901020.00.00.00.09.52349625.207978764.333725
Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0platform_web0.0513560.0483120.003043-0.0139850.97420
1y_pre15420.53016651673.00527436252.4751080.0088420.22562
2tenure_months13.75630113.7944000.0380990.0051980.56587
3platform_ios0.3000320.3014440.0014120.0030791.00000
4avg_sessions_week4.9959695.0036100.0076410.0018160.88605
5spend_last_month75.17656075.2637920.0872320.0012220.22678
6discount_rate0.1001970.1001290.000068-0.0010310.64488
Result

png

Result

png