Estimand: Average Treatment Effect (ATE)
The model assumes random assignment of treatment . The ATE () is defined as:
The sample estimator is the difference in group means:
Inference Methods
Welch's T-Test (ttest)
Used for continuous outcomes without assuming equal variances.
- Standard Error: , where is the sample variance.
- Degrees of Freedom (Satterthwaite):
- Confidence Interval:
- P-value: Two-sided test based on the -distribution with degrees of freedom.
Conversion Z-Test (conversion_ztest)
Optimized for binary conversion outcomes.
- Proportions:
- Standard Error (Pooled): where
- Absolute CI (Newcombe-style): Uses the difference of Wilson score intervals:
where are the Wilson score bounds for proportion .
Welch Permutation T-Test (welch_permutation_t_test)
Uses the Welch statistic, but estimates the p-value by repeatedly permuting treatment labels while preserving group sizes.
- Statistic: .
- Absolute CI: Welch-Satterthwaite interval for .
- P-value: Monte Carlo permutation p-value with a +1 correction:
Relative Lift and Delta Method
The relative lift is calculated as:
Its variance is estimated via the Delta Method:
Pseudo-code
Model Wrapper
Welch's T-Test Inference
Conversion Z-Test
Welch Permutation T-Test Inference
References
Estimand / difference-in-means ATE under random assignment
-
Neyman (1923; English translation 1990) — potential outcomes framework for randomized experiments; difference-in-means and its sampling properties under randomization. (ics.uci.edu)
-
Rubin (1974) — formalizes causal effects via potential outcomes for randomized (and nonrandomized) studies; motivates ATE as an estimand. (Ovid)
-
Freedman (2008) — uses the Neyman/Rubin potential-outcomes model to discuss inference and common adjustments in experiments (helpful for “design-based” framing around diff-in-means). (causal.unc.edu)
-
Imbens & Rubin (2015) — book, but extremely standard citation for ATE estimands + sampling variances for average causal effects. (Cambridge University Press & Assessment)
Welch’s t-test + Satterthwaite df
-
Student (1908) — original t distribution and small-sample inference motivation. (JSTOR)
-
Welch (1947) — the unequal-variance two-sample t-test (the modern “Welch’s t-test”). DOI: 10.1093/biomet/34.1-2.28. (OUP Academic)
-
Satterthwaite (1946) — effective degrees of freedom approximation (the “Welch–Satterthwaite” df you wrote). (JSTOR)
Conversion z-test + “Newcombe/Wilson-style” absolute CI for two proportions
Wilson score interval (single proportion):
- Wilson (1927) — the score interval for a binomial proportion (better coverage than Wald). (JSTOR)
Newcombe’s CI for difference of independent proportions (the one you’re implementing as “Newcombe-style / Wilson score bounds then subtract”):
- Newcombe (1998) — “Interval estimation for the difference between independent proportions: comparison of eleven methods.” This is the go-to citation for the Wilson-score-based risk-difference CI family. DOI: 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I. (PubMed)
(If you also want a citation for Wilson/score intervals generally outperforming “exact”/Wald-style intervals in practice:)
-
Agresti & Coull (1998) — “Approximate is Better than ‘Exact’…” (single-proportion intervals, but commonly cited in the same discussion). (math.unm.edu)
-
Agresti & Caffo (2000) — simple adjusted intervals for proportions and differences (alternative to Newcombe; useful “related work” citation). (Statistics)
Permutation tests
-
Fisher (1935) — randomization/permutation testing as exact design-based inference under random assignment.
-
Good (2005) — practical reference for permutation, parametric, and bootstrap tests, including Monte Carlo permutation p-values.
Relative lift (ratio) + Delta method variance
- Oehlert (1992) — clean, standard citation for delta method approximations. DOI: 10.1080/00031305.1992.10475842. (Taylor & Francis Online)
And since your pseudo-code explicitly guards when the control mean is near zero (ratio instability), it’s also common to cite ratio-CI alternatives:
- Fieller (1954) — Fieller-type confidence intervals for ratios (classic reference when denominators can be near 0). (JSTOR)