Estimand: Average Treatment Effect (ATE)
The model assumes random assignment of treatment . The ATE () is defined as:
The sample estimator is the difference in group means:
Inference Methods
Welch's T-Test (ttest)
Used for continuous outcomes without assuming equal variances.
- Standard Error: , where is the sample variance.
- Degrees of Freedom (Satterthwaite):
- Confidence Interval:
- P-value: Two-sided test based on the -distribution with degrees of freedom.
Conversion Z-Test (conversion_ztest)
Optimized for binary conversion outcomes.
- Proportions:
- Standard Error (Pooled): where
- Absolute CI (Newcombe-style): Uses the difference of Wilson score intervals:
where are the Wilson score bounds for proportion .
Bootstrap (bootstrap)
Non-parametric estimation by resampling data with replacement times.
- Absolute CI: Percentile-based interval .
- P-value: Normal approximation using the bootstrap standard error :
Relative Lift and Delta Method
The relative lift is calculated as:
Its variance is estimated via the Delta Method:
Pseudo-code
Model Wrapper
Welch's T-Test Inference
Conversion Z-Test
Bootstrap Inference
References
Estimand / difference-in-means ATE under random assignment
-
Neyman (1923; English translation 1990) — potential outcomes framework for randomized experiments; difference-in-means and its sampling properties under randomization. (ics.uci.edu)
-
Rubin (1974) — formalizes causal effects via potential outcomes for randomized (and nonrandomized) studies; motivates ATE as an estimand. (Ovid)
-
Freedman (2008) — uses the Neyman/Rubin potential-outcomes model to discuss inference and common adjustments in experiments (helpful for “design-based” framing around diff-in-means). (causal.unc.edu)
-
Imbens & Rubin (2015) — book, but extremely standard citation for ATE estimands + sampling variances for average causal effects. (Cambridge University Press & Assessment)
Welch’s t-test + Satterthwaite df
-
Student (1908) — original t distribution and small-sample inference motivation. (JSTOR)
-
Welch (1947) — the unequal-variance two-sample t-test (the modern “Welch’s t-test”). DOI: 10.1093/biomet/34.1-2.28. (OUP Academic)
-
Satterthwaite (1946) — effective degrees of freedom approximation (the “Welch–Satterthwaite” df you wrote). (JSTOR)
Conversion z-test + “Newcombe/Wilson-style” absolute CI for two proportions
Wilson score interval (single proportion):
- Wilson (1927) — the score interval for a binomial proportion (better coverage than Wald). (JSTOR)
Newcombe’s CI for difference of independent proportions (the one you’re implementing as “Newcombe-style / Wilson score bounds then subtract”):
- Newcombe (1998) — “Interval estimation for the difference between independent proportions: comparison of eleven methods.” This is the go-to citation for the Wilson-score-based risk-difference CI family. DOI: 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I. (PubMed)
(If you also want a citation for Wilson/score intervals generally outperforming “exact”/Wald-style intervals in practice:)
-
Agresti & Coull (1998) — “Approximate is Better than ‘Exact’…” (single-proportion intervals, but commonly cited in the same discussion). (math.unm.edu)
-
Agresti & Caffo (2000) — simple adjusted intervals for proportions and differences (alternative to Newcombe; useful “related work” citation). (Statistics)
Bootstrap percentile CI + bootstrap SE / normal-approx p-value
Foundational bootstrap + CI methodology:
-
Efron (1979) — original bootstrap paper; standard citation for nonparametric bootstrap resampling logic. DOI: 10.1214/aos/1176344552. (Project Euclid)
-
Efron (1987) — improved bootstrap CIs (BC/BCa ideas; useful if you later add BCa). DOI: 10.1080/01621459.1987.10478410. (Taylor & Francis Online)
-
DiCiccio & Efron (1996) — survey of bootstrap confidence intervals (nice umbrella reference). DOI: 10.1214/ss/1032280214. (Project Euclid)
-
Hall (1992) — theory/Edgeworth accuracy; useful if you want to justify why percentile vs studentized/BCa differ in coverage. (liu.w.waseda.jp)
(If you want a single “practitioner-friendly” bootstrap reference for SE/percentiles, you can also cite Efron & Tibshirani’s book—common but a book, not a paper.) (Amazon)
Relative lift (ratio) + Delta method variance
- Oehlert (1992) — clean, standard citation for delta method approximations. DOI: 10.1080/00031305.1992.10475842. (Taylor & Francis Online)
And since your pseudo-code explicitly guards when the control mean is near zero (ratio instability), it’s also common to cite ratio-CI alternatives:
- Fieller (1954) — Fieller-type confidence intervals for ratios (classic reference when denominators can be near 0). (JSTOR)