Estimand: Average Treatment Effect (ATE)

The model assumes random assignment of treatment $D \in \{0, 1\}$ . The ATE ( $\tau$ ) is defined as:
$\tau = E[Y | D=1] - E[Y | D=0]$
The sample estimator is the difference in group means:
$\hat{\tau} = \bar{Y}_1 - \bar{Y}_0$

Inference Methods

Welch's T-Test (ttest)
Used for continuous outcomes without assuming equal variances.

Standard Error: $SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_0^2}{n_0}}$ , where $s^2$ is the sample variance.
Degrees of Freedom (Satterthwaite):
$\nu \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_0^2}{n_0}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_0^2/n_0)^2}{n_0-1}}$
Confidence Interval: $\hat{\tau} \pm t_{1-\alpha/2, \nu} \cdot SE$
P-value: Two-sided test based on the $t$ -distribution with $\nu$ degrees of freedom.

Conversion Z-Test (conversion_ztest)
Optimized for binary conversion outcomes.

Proportions: $p_1 = \frac{X_1}{n_1}, p_0 = \frac{X_0}{n_0}$
Standard Error (Pooled): $SE_{pooled} = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_0}\right)}$ where $\hat{p} = \frac{X_1 + X_0}{n_1 + n_0}$
Absolute CI (Newcombe-style): Uses the difference of Wilson score intervals:
$CI = [L_1 - U_0, U_1 - L_0]$
where $[L_i, U_i]$ are the Wilson score bounds for proportion $p_i$ .

Bootstrap (bootstrap)
Non-parametric estimation by resampling data with replacement $B$ times.

Absolute CI: Percentile-based interval $\hat{\tau}^*_{[\alpha/2]}, \hat{\tau}^*_{[1-\alpha/2]}$ .
P-value: Normal approximation using the bootstrap standard error $s_{boot}$ :
$Z = \frac{\hat{\tau}}{s_{boot}}, \quad p = 2 \cdot (1 - \Phi(|Z|))$

Relative Lift and Delta Method

The relative lift is calculated as:
$Lift (\%) = 100 \cdot \left(\frac{\bar{Y}_1}{\bar{Y}_0} - 1\right)$
Its variance is estimated via the Delta Method:
$Var(Lift/100) \approx \frac{1}{\bar{Y}_0^2} Var(\bar{Y}_1) + \frac{\bar{Y}_1^2}{\bar{Y}_0^4} Var(\bar{Y}_0)$

Pseudo-code

Model Wrapper

Welch's T-Test Inference

Conversion Z-Test

Bootstrap Inference

References

Estimand / difference-in-means ATE under random assignment

Neyman (1923; English translation 1990) — potential outcomes framework for randomized experiments; difference-in-means and its sampling properties under randomization. (ics.uci.edu)
Rubin (1974) — formalizes causal effects via potential outcomes for randomized (and nonrandomized) studies; motivates ATE as an estimand. (Ovid)
Freedman (2008) — uses the Neyman/Rubin potential-outcomes model to discuss inference and common adjustments in experiments (helpful for “design-based” framing around diff-in-means). (causal.unc.edu)
Imbens & Rubin (2015) — book, but extremely standard citation for ATE estimands + sampling variances for average causal effects. (Cambridge University Press & Assessment)

Welch’s t-test + Satterthwaite df

Student (1908) — original t distribution and small-sample inference motivation. (JSTOR)
Welch (1947) — the unequal-variance two-sample t-test (the modern “Welch’s t-test”). DOI: 10.1093/biomet/34.1-2.28. (OUP Academic)
Satterthwaite (1946) — effective degrees of freedom approximation (the “Welch–Satterthwaite” df you wrote). (JSTOR)

Conversion z-test + “Newcombe/Wilson-style” absolute CI for two proportions

Wilson score interval (single proportion):

Wilson (1927) — the score interval for a binomial proportion (better coverage than Wald). (JSTOR)

Newcombe’s CI for difference of independent proportions (the one you’re implementing as “Newcombe-style / Wilson score bounds then subtract”):

Newcombe (1998) — “Interval estimation for the difference between independent proportions: comparison of eleven methods.” This is the go-to citation for the Wilson-score-based risk-difference CI family. DOI: 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I. (PubMed)

(If you also want a citation for Wilson/score intervals generally outperforming “exact”/Wald-style intervals in practice:)

Agresti & Coull (1998) — “Approximate is Better than ‘Exact’…” (single-proportion intervals, but commonly cited in the same discussion). (math.unm.edu)
Agresti & Caffo (2000) — simple adjusted intervals for proportions and differences (alternative to Newcombe; useful “related work” citation). (Statistics)

Bootstrap percentile CI + bootstrap SE / normal-approx p-value

Foundational bootstrap + CI methodology:

Efron (1979) — original bootstrap paper; standard citation for nonparametric bootstrap resampling logic. DOI: 10.1214/aos/1176344552. (Project Euclid)
Efron (1987) — improved bootstrap CIs (BC/BCa ideas; useful if you later add BCa). DOI: 10.1080/01621459.1987.10478410. (Taylor & Francis Online)
DiCiccio & Efron (1996) — survey of bootstrap confidence intervals (nice umbrella reference). DOI: 10.1214/ss/1032280214. (Project Euclid)
Hall (1992) — theory/Edgeworth accuracy; useful if you want to justify why percentile vs studentized/BCa differ in coverage. (liu.w.waseda.jp)

(If you want a single “practitioner-friendly” bootstrap reference for SE/percentiles, you can also cite Efron & Tibshirani’s book—common but a book, not a paper.) (Amazon)

Relative lift (ratio) + Delta method variance

Oehlert (1992) — clean, standard citation for delta method approximations. DOI: 10.1080/00031305.1992.10475842. (Taylor & Francis Online)

And since your pseudo-code explicitly guards when the control mean is near zero (ratio instability), it’s also common to cite ratio-CI alternatives:

Fieller (1954) — Fieller-type confidence intervals for ratios (classic reference when denominators can be near 0). (JSTOR)