Research3 min read

DGP generate_classic_rct_26

Automated conversion of generate_classic_rct_26.ipynb

DGP generate_classic_rct_26

Math Explanation of the generate_classic_rct_26 DGP

The generate_classic_rct_26 function generates a synthetic dataset for a Classic Randomized Controlled Trial (RCT). In this scenario, treatment is assigned completely at random, and covariates XX affect the outcome (prognostic) but do not influence the treatment assignment (no confounding).

By default, it simulates a conversion experiment (binary outcome) with 10,000 samples and a 50/50 split.

Covariate Generation (Confounders)

Three binary covariates X=[x1,x2,x3]X = [x_1, x_2, x_3] are generated independently:

  • platform_ios (x1x_1): x1Bernoulli(0.5)x_1 \sim \text{Bernoulli}(0.5)
  • country_usa (x2x_2): x2Bernoulli(0.6)x_2 \sim \text{Bernoulli}(0.6)
  • source_paid (x3x_3): x3Bernoulli(0.3)x_3 \sim \text{Bernoulli}(0.3)

Treatment Assignment (DD)

Since it is an RCT, the treatment DD is independent of XX. It is assigned with a probability P(D=1)=0.5P(D=1) = 0.5: DBernoulli(0.5)D \sim \text{Bernoulli}(0.5) The log-odds of treatment (intercept αd\alpha_d) is 00.

Outcome Generation (conversion)

The outcome is a binary variable representing conversion. The probability of conversion for an individual is modeled using a logistic link function: P(conversion=1D,X)=σ(L)P(\text{conversion}=1 \mid D, X) = \sigma(L) where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.

The latent linear predictor LL is defined as: L=αy+j=13βy,jxj+gy(X)+DθL = \alpha_y + \sum_{j=1}^3 \beta_{y,j} x_j + g_y(X) + D \cdot \theta

By default, gy(X)=0g_y(X)=0 because add_pre=False. The nonlinear term is only included when add_pre=True (or when g_y/use_prognostic is provided).

  • Baseline Intercept (αy\alpha_y): Derived from the target control conversion rate pA=0.10p_A = 0.10. αy=logit(0.10)=ln(0.100.90)2.197\alpha_y = \text{logit}(0.10) = \ln\left(\frac{0.10}{0.90}\right) \approx -2.197 This sets the baseline rate at X=0X=0; the marginal control rate can differ once XX shifts log-odds.
  • Treatment Effect (θ\theta): Derived from the target treatment conversion rate pB=0.11p_B = 0.11. It represents the shift in log-odds. θ=logit(0.11)logit(0.10)=ln(0.110.89)ln(0.100.90)0.106\theta = \text{logit}(0.11) - \text{logit}(0.10) = \ln\left(\frac{0.11}{0.89}\right) - \ln\left(\frac{0.10}{0.90}\right) \approx 0.106 This is a baseline log-odds shift; the marginal ATE on the probability scale is not exactly 1% once XX effects are present.
  • Prognostic Coefficients (βy\beta_y): By default, βy=[0.6,0.4,0.8]\beta_y = [0.6, 0.4, 0.8]. These values determine how much each covariate shifts the log-odds of conversion.

The outcome is a binary variable representing conversion. The probability of conversion for an individual is modeled using a logistic link function: P(conversion=1D,X)=σ(L)P(\text{conversion}=1 \mid D, X) = \sigma(L) where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.

The latent linear predictor LL is defined as: L=αy+j=13βy,jxj+gy(X)+DθL = \alpha_y + \sum_{j=1}^3 \beta_{y,j} x_j + g_y(X) + D \cdot \theta

  • Baseline Intercept (αy\alpha_y): Derived from the target control conversion rate pA=0.10p_A = 0.10. αy=logit(0.10)=ln(0.100.90)2.197\alpha_y = \text{logit}(0.10) = \ln\left(\frac{0.10}{0.90}\right) \approx -2.197
  • Treatment Effect (θ\theta): Derived from the target treatment conversion rate pB=0.11p_B = 0.11. It represents the shift in log-odds. θ=logit(0.11)logit(0.10)=ln(0.110.89)ln(0.100.90)0.106\theta = \text{logit}(0.11) - \text{logit}(0.10) = \ln\left(\frac{0.11}{0.89}\right) - \ln\left(\frac{0.10}{0.90}\right) \approx 0.106 On the probability scale, this corresponds to an Average Treatment Effect (ATE) of 1%\approx 1\%.
  • Prognostic Coefficients (βy\beta_y): By default, βy=[0.6,0.4,0.8]\beta_y = [0.6, 0.4, 0.8]. These values determine how much each covariate shifts the log-odds of conversion.

Summary of Default Parameters

  • NN: 10,000
  • Control Conversion (baseline pAp_A): 10% at X=0X=0 (marginal rate can differ with XX)
  • Treatment Conversion (baseline pBp_B): 11% at X=0X=0 (marginal rate can differ with XX)
  • Treatment Split: 50%
  • Confounders: 3 binary
  • Nonlinear gy(X)g_y(X): only when add_pre=True or g_y/use_prognostic is provided

DGP

Result
user_idconversiondplatform_ioscountry_usasource_paidmm_obstau_linkg0g1cate
08826d0.00.01.00.01.00.50.50.1064830.3106200.3338680.023249
12416d0.01.00.00.01.00.50.50.1064830.1982570.2157270.017471
2eb8190.00.01.01.00.00.50.50.1064830.2319690.2514790.019509
3714450.01.01.01.00.00.50.50.1064830.2319690.2514790.019509
413d160.01.00.01.00.00.50.50.1064830.1421890.1556780.013489
Result

Ground truth ATE is 0.01719144406311028 Ground truth ATTE is 0.017278385179220486

Result

CausalData(df=(10000, 5), treatment='d', outcome='conversion', confounders=['platform_ios', 'country_usa', 'source_paid'])

EDA

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.049550.1989910.3992810.00.00.00.00.01.01.0
11.050450.2329040.4227230.00.00.00.00.01.01.0
Result

png

Balance check

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0source_paid0.2990920.3137760.0146840.0318530.64592
1platform_ios0.4940460.5028740.0088280.0176540.98861
2country_usa0.5862760.5918730.0055970.0113741.00000