Scenario4 min read

Estimand: Average Treatment Effect (ATE)

The model assumes random assignment of treatment $D \in \{0, 1\}$. The ATE ($\tau$) is defined as:

Estimand: Average Treatment Effect (ATE)

The model assumes random assignment of treatment D{0,1}D \in \{0, 1\}. The ATE (τ\tau) is defined as:
τ=E[YD=1]E[YD=0]\tau = E[Y | D=1] - E[Y | D=0]
The sample estimator is the difference in group means:
τ^=Yˉ1Yˉ0\hat{\tau} = \bar{Y}_1 - \bar{Y}_0

Inference Methods

Welch's T-Test (ttest)
Used for continuous outcomes without assuming equal variances.

  • Standard Error: SE=s12n1+s02n0SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_0^2}{n_0}}, where s2s^2 is the sample variance.
  • Degrees of Freedom (Satterthwaite):
    ν(s12n1+s02n0)2(s12/n1)2n11+(s02/n0)2n01\nu \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_0^2}{n_0}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_0^2/n_0)^2}{n_0-1}}
  • Confidence Interval: τ^±t1α/2,νSE\hat{\tau} \pm t_{1-\alpha/2, \nu} \cdot SE
  • P-value: Two-sided test based on the tt-distribution with ν\nu degrees of freedom.

Conversion Z-Test (conversion_ztest)
Optimized for binary conversion outcomes.

  • Proportions: p1=X1n1,p0=X0n0p_1 = \frac{X_1}{n_1}, p_0 = \frac{X_0}{n_0}
  • Standard Error (Pooled): SEpooled=p^(1p^)(1n1+1n0)SE_{pooled} = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_0}\right)} where p^=X1+X0n1+n0\hat{p} = \frac{X_1 + X_0}{n_1 + n_0}
  • Absolute CI (Newcombe-style): Uses the difference of Wilson score intervals:
    CI=[L1U0,U1L0]CI = [L_1 - U_0, U_1 - L_0]
    where [Li,Ui][L_i, U_i] are the Wilson score bounds for proportion pip_i.

Bootstrap (bootstrap)
Non-parametric estimation by resampling data with replacement BB times.

  • Absolute CI: Percentile-based interval τ^[α/2],τ^[1α/2]\hat{\tau}^*_{[\alpha/2]}, \hat{\tau}^*_{[1-\alpha/2]}.
  • P-value: Normal approximation using the bootstrap standard error sboots_{boot}:
    Z=τ^sboot,p=2(1Φ(Z))Z = \frac{\hat{\tau}}{s_{boot}}, \quad p = 2 \cdot (1 - \Phi(|Z|))

Relative Lift and Delta Method

The relative lift is calculated as:
Lift(%)=100(Yˉ1Yˉ01)Lift (\%) = 100 \cdot \left(\frac{\bar{Y}_1}{\bar{Y}_0} - 1\right)
Its variance is estimated via the Delta Method:
Var(Lift/100)1Yˉ02Var(Yˉ1)+Yˉ12Yˉ04Var(Yˉ0)Var(Lift/100) \approx \frac{1}{\bar{Y}_0^2} Var(\bar{Y}_1) + \frac{\bar{Y}_1^2}{\bar{Y}_0^4} Var(\bar{Y}_0)


Pseudo-code

Model Wrapper

Welch's T-Test Inference

Conversion Z-Test

Bootstrap Inference

References

Estimand / difference-in-means ATE under random assignment

  • Neyman (1923; English translation 1990) — potential outcomes framework for randomized experiments; difference-in-means and its sampling properties under randomization. (ics.uci.edu)

  • Rubin (1974) — formalizes causal effects via potential outcomes for randomized (and nonrandomized) studies; motivates ATE as an estimand. (Ovid)

  • Freedman (2008) — uses the Neyman/Rubin potential-outcomes model to discuss inference and common adjustments in experiments (helpful for “design-based” framing around diff-in-means). (causal.unc.edu)

  • Imbens & Rubin (2015) — book, but extremely standard citation for ATE estimands + sampling variances for average causal effects. (Cambridge University Press & Assessment)

Welch’s t-test + Satterthwaite df

  • Student (1908) — original t distribution and small-sample inference motivation. (JSTOR)

  • Welch (1947) — the unequal-variance two-sample t-test (the modern “Welch’s t-test”). DOI: 10.1093/biomet/34.1-2.28. (OUP Academic)

  • Satterthwaite (1946) — effective degrees of freedom approximation (the “Welch–Satterthwaite” df you wrote). (JSTOR)

Conversion z-test + “Newcombe/Wilson-style” absolute CI for two proportions

Wilson score interval (single proportion):

  • Wilson (1927) — the score interval for a binomial proportion (better coverage than Wald). (JSTOR)

Newcombe’s CI for difference of independent proportions (the one you’re implementing as “Newcombe-style / Wilson score bounds then subtract”):

  • Newcombe (1998) — “Interval estimation for the difference between independent proportions: comparison of eleven methods.” This is the go-to citation for the Wilson-score-based risk-difference CI family. DOI: 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I. (PubMed)

(If you also want a citation for Wilson/score intervals generally outperforming “exact”/Wald-style intervals in practice:)

  • Agresti & Coull (1998) — “Approximate is Better than ‘Exact’…” (single-proportion intervals, but commonly cited in the same discussion). (math.unm.edu)

  • Agresti & Caffo (2000) — simple adjusted intervals for proportions and differences (alternative to Newcombe; useful “related work” citation). (Statistics)

Bootstrap percentile CI + bootstrap SE / normal-approx p-value

Foundational bootstrap + CI methodology:

  • Efron (1979) — original bootstrap paper; standard citation for nonparametric bootstrap resampling logic. DOI: 10.1214/aos/1176344552. (Project Euclid)

  • Efron (1987) — improved bootstrap CIs (BC/BCa ideas; useful if you later add BCa). DOI: 10.1080/01621459.1987.10478410. (Taylor & Francis Online)

  • DiCiccio & Efron (1996) — survey of bootstrap confidence intervals (nice umbrella reference). DOI: 10.1214/ss/1032280214. (Project Euclid)

  • Hall (1992) — theory/Edgeworth accuracy; useful if you want to justify why percentile vs studentized/BCa differ in coverage. (liu.w.waseda.jp)

(If you want a single “practitioner-friendly” bootstrap reference for SE/percentiles, you can also cite Efron & Tibshirani’s book—common but a book, not a paper.) (Amazon)

Relative lift (ratio) + Delta method variance

  • Oehlert (1992) — clean, standard citation for delta method approximations. DOI: 10.1080/00031305.1992.10475842. (Taylor & Francis Online)

And since your pseudo-code explicitly guards when the control mean is near zero (ratio instability), it’s also common to cite ratio-CI alternatives:

  • Fieller (1954) — Fieller-type confidence intervals for ratios (classic reference when denominators can be near 0). (JSTOR)