Research2 min read

DGP classic_rct_gamma_26

Automated conversion of classic_rct_gamma_26.ipynb

DGP classic_rct_gamma_26

Math Explanation of the classic_rct_gamma_26 DGP

The classic_rct_gamma_26 function generates a synthetic dataset for a Classic Randomized Controlled Trial (RCT). Treatment is assigned completely at random, and covariates XX affect the outcome (prognostic) but do not influence treatment assignment (no confounding).

By default, it simulates a revenue experiment (gamma outcome) with 10,000 samples and a 50/50 split.

Covariate Generation (Confounders)

Three binary covariates X=[x1,x2,x3]X = [x_1, x_2, x_3] are generated independently:

  • platform_ios (x1x_1): x1Bernoulli(0.5)x_1 \sim \text{Bernoulli}(0.5)
  • country_usa (x2x_2): x2Bernoulli(0.6)x_2 \sim \text{Bernoulli}(0.6)
  • source_paid (x3x_3): x3Bernoulli(0.3)x_3 \sim \text{Bernoulli}(0.3)

Treatment Assignment (DD)

Since it is an RCT, the treatment DD is independent of XX. It is assigned with a probability P(D=1)=0.5P(D=1) = 0.5: DBernoulli(0.5)D \sim \text{Bernoulli}(0.5) The log-odds of treatment (intercept αd\alpha_d) is 00.

Outcome Generation (revenue)

The outcome is a positive, skewed metric (e.g., revenue) modeled with a Gamma distribution using a log-mean link: YD,XGamma(k,θ)Y \mid D, X \sim \text{Gamma}(k, \theta) logE[YD,X]=log(kθA)+βyX+gy(X)+Dlog(θB/θA)\log \mathbb{E}[Y \mid D, X] = \log(k \cdot \theta_A) + \beta_y^\top X + g_y(X) + D \cdot \log(\theta_B/\theta_A)

By default, gy(X)=0g_y(X)=0 because add_pre=False. The nonlinear term is only included when add_pre=True (or when g_y/use_prognostic is provided).

  • Shape (kk): Default k=2.0k=2.0.
  • Scale (control θA\theta_A): Default 15.015.0.
  • Scale (treatment θB\theta_B): Default 16.516.5.
  • Prognostic Coefficients (βy\beta_y): Default [0.25,0.20,0.45][0.25, 0.20, 0.45]. These values shift the log-mean of revenue.

With the defaults, the group means are: E[YD=0]=kθA=2.0×15.0=30.0\mathbb{E}[Y \mid D=0] = k \cdot \theta_A = 2.0 \times 15.0 = 30.0 E[YD=1]=kθB=2.0×16.5=33.0\mathbb{E}[Y \mid D=1] = k \cdot \theta_B = 2.0 \times 16.5 = 33.0 This corresponds to a 10% uplift on the mean scale before covariate effects.

Summary of Default Parameters

  • NN: 10,000
  • Control Mean (baseline): 30.030.0 at X=0X=0
  • Treatment Mean (baseline): 33.033.0 at X=0X=0
  • Treatment Split: 50%
  • Confounders: 3 binary
  • Nonlinear gy(X)g_y(X): only when add_pre=True or g_y/use_prognostic is provided

DGP

Result
user_idrevenuedplatform_ioscountry_usasource_paidagecnt_transplatform_Androidplatform_iOSinvited_friendmm_obstau_linkg0g1cate
0ae00662.0150010.01.00.01.03900100.50.50.0953160.41258166.4538396.041258
16051e22.3531861.00.00.01.04640100.50.50.0953147.04936651.7543024.704937
2eb08c38.2131000.01.01.00.03611000.50.50.0953147.04936651.7543024.704937
3a947a77.9270951.01.01.00.02620100.50.50.0953147.04936651.7543024.704937
49bd4224.9360851.00.01.00.03530100.50.50.0953136.64208340.3062913.664208
Result

Ground truth ATE is 4.547002938251698 Ground truth ATTE is 4.571328700457634

Result

CausalData(df=(10000, 5), treatment='d', outcome='revenue', confounders=['platform_ios', 'country_usa', 'source_paid'])

EDA

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.0495545.27122836.1850630.33838011.08153020.32660235.78425460.06395889.576241431.357219
11.0504550.32888338.8027100.44825512.33552122.79290041.14939667.409299100.121519401.883422
Result

png

Balance check

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0source_paid0.2990920.3137760.0146840.0318530.64592
1platform_ios0.4940460.5028740.0088280.0176540.98861
2country_usa0.5862760.5918730.0055970.0113741.00000