DGP generate_classic_rct_26

This notebook presents the generate classic rct 26 research workflow and key analysis steps.

Math Explanation of the `generate_classic_rct_26` DGP

The generate_classic_rct_26 function generates a synthetic dataset for a Classic Randomized Controlled Trial (RCT). In this scenario, treatment is assigned completely at random, and covariates $X$ affect the outcome (prognostic) but do not influence the treatment assignment (no confounding).

By default, it simulates a conversion experiment (binary outcome) with 10,000 samples and a 50/50 split.

Covariate Generation (Confounders)

Three binary covariates $X = [x_1, x_2, x_3]$ are generated independently:

platform_ios ( $x_1$ ): $x_1 \sim \text{Bernoulli}(0.5)$
country_usa ( $x_2$ ): $x_2 \sim \text{Bernoulli}(0.6)$
source_paid ( $x_3$ ): $x_3 \sim \text{Bernoulli}(0.3)$

Treatment Assignment ( $D$ )

Since it is an RCT, the treatment $D$ is independent of $X$ . It is assigned with a probability $P(D=1) = 0.5$ : $D \sim \text{Bernoulli}(0.5)$ The log-odds of treatment (intercept $\alpha_d$ ) is $0$ .

Outcome Generation (`conversion`)

The outcome is a binary variable representing conversion. The probability of conversion for an individual is modeled using a logistic link function: $P(\text{conversion}=1 \mid D, X) = \sigma(L)$ where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

The latent linear predictor $L$ is defined as: $L = \alpha_y + \sum_{j=1}^3 \beta_{y,j} x_j + g_y(X) + D \cdot \theta$

By default, $g_y(X)=0$ because add_pre=False. The nonlinear term is only included when add_pre=True (or when g_y/use_prognostic is provided).

Baseline Intercept ( $\alpha_y$ ): Derived from the target control conversion rate $p_A = 0.10$ . $\alpha_y = \text{logit}(0.10) = \ln\left(\frac{0.10}{0.90}\right) \approx -2.197$ This sets the baseline rate at $X=0$ ; the marginal control rate can differ once $X$ shifts log-odds.
Treatment Effect ( $\theta$ ): Derived from the target treatment conversion rate $p_B = 0.11$ . It represents the shift in log-odds. $\theta = \text{logit}(0.11) - \text{logit}(0.10) = \ln\left(\frac{0.11}{0.89}\right) - \ln\left(\frac{0.10}{0.90}\right) \approx 0.106$ This is a baseline log-odds shift; the marginal ATE on the probability scale is not exactly 1% once $X$ effects are present.
Prognostic Coefficients ( $\beta_y$ ): By default, $\beta_y = [0.6, 0.4, 0.8]$ . These values determine how much each covariate shifts the log-odds of conversion.

The latent linear predictor $L$ is defined as: $L = \alpha_y + \sum_{j=1}^3 \beta_{y,j} x_j + g_y(X) + D \cdot \theta$

Baseline Intercept ( $\alpha_y$ ): Derived from the target control conversion rate $p_A = 0.10$ . $\alpha_y = \text{logit}(0.10) = \ln\left(\frac{0.10}{0.90}\right) \approx -2.197$
Treatment Effect ( $\theta$ ): Derived from the target treatment conversion rate $p_B = 0.11$ . It represents the shift in log-odds. $\theta = \text{logit}(0.11) - \text{logit}(0.10) = \ln\left(\frac{0.11}{0.89}\right) - \ln\left(\frac{0.10}{0.90}\right) \approx 0.106$ On the probability scale, this corresponds to an Average Treatment Effect (ATE) of $\approx 1\%$ .
Prognostic Coefficients ( $\beta_y$ ): By default, $\beta_y = [0.6, 0.4, 0.8]$ . These values determine how much each covariate shifts the log-odds of conversion.

Summary of Default Parameters

$N$ : 10,000
Control Conversion (baseline $p_A$ ): 10% at $X=0$ (marginal rate can differ with $X$ )
Treatment Conversion (baseline $p_B$ ): 11% at $X=0$ (marginal rate can differ with $X$ )
Treatment Split: 50%
Confounders: 3 binary
Nonlinear $g_y(X)$ : only when add_pre=True or g_y/use_prognostic is provided

DGP

Result

	user_id	d	platform_ios	country_usa	source_paid	m	m_obs	tau_link	g0	g1	cate
0	8826d	0.0	1.0	0.0	1.0	0.5	0.5	0.106483	0.310620	0.333868	0.023249
1	2416d	1.0	0.0	0.0	1.0	0.5	0.5	0.106483	0.198257	0.215727	0.017471
2	eb819	0.0	1.0	1.0	0.0	0.5	0.5	0.106483	0.231969	0.251479	0.019509
3	71445	1.0	1.0	1.0	0.0	0.5	0.5	0.106483	0.231969	0.251479	0.019509
4	13d16	1.0	0.0	1.0	0.0	0.5	0.5	0.106483	0.142189	0.155678	0.013489

Result

Ground truth ATE is 0.01719144406311028 Ground truth ATTE is 0.017278385179220486

Result

CausalData(df=(10000, 5), treatment='d', outcome='conversion', confounders=['platform_ios', 'country_usa', 'source_paid'])

EDA

Result

	treatment	count	mean	std	min	p10	p25	median	p75	p90	max
0	0.0	4955	0.198991	0.399281	0.0	0.0	0.0	0.0	0.0	1.0	1.0
1	1.0	5045	0.232904	0.422723	0.0	0.0	0.0	0.0	0.0	1.0	1.0

Result

png

Balance check

Result

	confounders	mean_d_0	mean_d_1	abs_diff	smd	ks_pvalue
0	source_paid	0.299092	0.313776	0.014684	0.031853	0.64592
1	platform_ios	0.494046	0.502874	0.008828	0.017654	0.98861
2	country_usa	0.586276	0.591873	0.005597	0.011374	1.00000

DGP generate_classic_rct_26

DGP generate_classic_rct_26

Math Explanation of the generate_classic_rct_26 DGP

Covariate Generation (Confounders)

Treatment Assignment (DDD)

Outcome Generation (conversion)

Summary of Default Parameters

DGP

EDA

Balance check

Math Explanation of the `generate_classic_rct_26` DGP

Treatment Assignment ( $D$ )

Outcome Generation (`conversion`)