Scenario7 min read

CUPEDModel

- SUTVA / consistency

CUPEDModel

0) Assumptions

  • SUTVA / consistency
  • No interference: unit ii’s outcome does not depend on other units’ treatment assignments.

  • No hidden versions: “treatment” and “control” correspond to well-defined interventions.

  • Consistency: the observed outcome equals the potential outcome under the realized treatment: if Di=1D_i=1, then Yi=Yi(1)Y_i=Y_i(1); if Di=0D_i=0, then Yi=Yi(0)Y_i=Y_i(0).

  • Random assignment or Unconfoundedness (ATE/ITT identification): In an RCT, treatment is independent of potential outcomes (and baseline covariates):

(Yi(1),Yi(0))Di. (Y_i(1),Y_i(0)) \perp D_i.

This is the key condition under which the estimand τ=E[Y(1)Y(0)]\tau=\mathbb{E}[Y(1)-Y(0)] is identified as an average causal effect.

  • Overlap / positivity: Both arms occur with nonzero probability:
0<Pr(Di=1)<1, 0<\Pr(D_i=1)<1,

(and within any randomization strata if stratified), ensuring the ATE/ITT is estimable from the observed design.

  • Regression is a working model (robust inference): The Lin specification is used as an adjustment; it need not be exactly correct. HC2 standard errors target valid large-sample inference under heteroskedasticity and potential misspecification of the conditional mean.

  • Finite second moments: Outcomes and regressors have finite second moments (so variances exist), which supports HC-type variance estimation and the delta-method approximation used later.

  • Design matrix regularity: The constructed design matrix

Z=[1,,d,,Xc,,dXc] Z=[\mathbf{1},, d,, X^c,, d\odot X^c]

has full column rank (no perfect multicollinearity), so (ZZ)1(Z^\top Z)^{-1} exists. In practice this also means you avoid near-duplicate covariates and handle zero-variance / constant columns.

  • Leverage not degenerate (HC2 well-defined): For HC2 you require 1hii>01-h_{ii}>0 for all ii (true when ZZ has full rank and no observation is perfectly fit), so the HC2 weights ωi=e^i2/(1hii)\omega_i=\hat e_i^2/(1-h_{ii}) are finite.

  • Relative-effect CI (delta method, “nocov”): For the relative CI you additionally assume a first-order Taylor approximation is adequate, μ^c\hat\mu_c is not near zero, and you ignore Cov(τ^,μ^c)\mathrm{Cov}(\hat\tau,\hat\mu_c) (your deliberate “nocov” rule).

1) Data + target estimand

You observe i.i.d. units i=1,,ni=1,\dots,n with:

  • outcome YiRY_i \in \mathbb{R} (post-period),
  • treatment Di{0,1}D_i \in\{0,1\},
  • pre-treatment covariates XiRpX_i \in \mathbb{R}^p (chosen subset).

Target (ATE/ITT):

τ=E![Yi(1)Yi(0)].\tau = \mathbb{E}!\left[Y_i(1)-Y_i(0)\right].

2) Global centering of covariates (full-sample centering)

Let pp^\star be the number of covariates you actually keep (after any variance / quality filtering). For each kept covariate j=1,,pj=1,\dots,p^\star, center over the entire sample:

Xijc:=XijXˉj,Xˉj=1ni=1nXij.X^c_{ij} := X_{ij}-\bar X_j, \qquad \bar X_j=\frac{1}{n}\sum_{i=1}^n X_{ij}.

In matrix form, with 1Rn\mathbf{1}\in\mathbb{R}^n:

Xc=X1Xˉ,Xˉ=1nX1.X^c = X-\mathbf{1}\bar X^\top, \qquad \bar X=\frac{1}{n}X^\top \mathbf{1}.

Key property (by construction):

1ni=1nXijc=0for all jEn[Xc]=0.\frac{1}{n}\sum_{i=1}^n X^c_{ij}=0 \quad \text{for all } j \qquad\Longleftrightarrow\qquad \mathbb{E}_n[X^c]=0.

Why this matters for interpretation (with interactions):

With global centering, the regression intercepts line up at the sample mean covariate level. In particular, the “main” treatment coefficient becomes the difference in intercepts at X=XˉX=\bar X, which is exactly what you want when you interpret τ^\hat\tau as the ATE/ITT under Lin’s fully-interacted adjustment.

If you don’t center and estimate:

Yi=α+τDi+Xiβ+DiXiγ+εi,Y_i=\alpha+\tau D_i+X_i^\top\beta + D_i X_i^\top\gamma+\varepsilon_i,

then the interaction term vanishes at X=0X=0, so

τ=treatment effect at X=0,\tau = \text{treatment effect at } X=0,

which is often not a meaningful reference point (age =0=0, revenue =0=0, etc.). The regression can still be statistically valid, but τ\tau no longer directly matches the “average-over-the-covariate-distribution” effect unless the covariates happen to have mean zero.


3) Build the Lin (2013) fully-interacted design matrix ZZ

Let

d=(D1,,Dn)Rn.d=(D_1,\dots,D_n)^\top\in\mathbb{R}^n.

Define the elementwise (row-wise) interaction matrix dXcRn×pd\odot X^c\in\mathbb{R}^{n\times p^\star} by

(dXc)ij=DiXijc.(d\odot X^c)_{ij} = D_iX^c_{ij}.

Then the Lin (2013) fully-interacted design matrix is:

Z=[1dXc(dXc)]Rn×k,k=2+2p.Z = \begin{bmatrix} \mathbf{1} & d & X^c & (d\odot X^c) \end{bmatrix} \in\mathbb{R}^{n\times k}, \qquad k=2+2p^\star.

Partition the parameter vector as:

θ=[α τ β γ],β,γRp.\theta= \begin{bmatrix} \alpha\ \tau\ \beta\ \gamma \end{bmatrix}, \qquad \beta,\gamma\in\mathbb{R}^{p^\star}.

Row-wise model:

Yi=α+τDi+(Xic)β+Di(Xic)γ+εi.Y_i = \alpha +\tau D_i +(X_i^c)^\top\beta +D_i (X_i^c)^\top\gamma +\varepsilon_i.

4) Why the coefficient on DD is the ATE (with global centering)

The regression implies two group-specific conditional mean functions:

  • Control D=0D=0:
E[YXc,D=0]=α+(Xc)β. \mathbb{E}[Y\mid X^c, D=0]=\alpha+(X^c)^\top\beta.
  • Treated D=1D=1:
E[YXc,D=1]=(α+τ)+(Xc)(β+γ). \mathbb{E}[Y\mid X^c, D=1]=(\alpha+\tau)+(X^c)^\top(\beta+\gamma).

So the conditional treatment effect as a function of covariates is:

Δ(Xc)=E[YXc,1]E[YXc,0]τ+(Xc)γ.\Delta(X^c) = \mathbb{E}[Y\mid X^c,1]-\mathbb{E}[Y\mid X^c,0] \tau+(X^c)^\top\gamma.

Now average this conditional effect over the covariate distribution used by the estimator:

E[Δ(Xc)]=τ+E[Xc]γ.\mathbb{E}[\Delta(X^c)] = \tau+\mathbb{E}[X^c]^\top\gamma.

With global centering, E[Xc]=0\mathbb{E}[X^c]=0 (in sample: En[Xc]=0\mathbb{E}_n[X^c]=0), hence:

E[Δ(Xc)]=τ.\mathbb{E}[\Delta(X^c)] = \tau.

So, under the globally-centered Lin specification:

  • τ\tau is the average treatment effect over the (centered) covariate distribution (ATE/ITT),
  • γ\gamma captures effect heterogeneity with respect to XX.

5) OLS fit (point estimate)

Fit OLS on Y,ZY,Z:

θ^=argminθ,YZθ22(ZZ)1ZY.\hat\theta = \arg\min_{\theta},|Y-Z\theta|_2^2 (Z^\top Z)^{-1}Z^\top Y.

The reported ATE/ITT estimate is the coefficient on the treatment column:

τ^=θ^(column of d).\hat\tau = \hat\theta_{\text{(column of }d\text{)}}.

Residuals:

e^=YZθ^,e^i=YiZiθ^.\hat e = Y-Z\hat\theta, \qquad \hat e_i = Y_i - Z_i^\top\hat\theta.

6) Robust covariance / standard error for τ^\hat\tau (HC2 only)

Let the OLS “bread” be:

B=(ZZ)1.B=(Z^\top Z)^{-1}.

Define the hat matrix and leverages:

H=Z(ZZ)1Z,hii=Hii.H=Z(Z^\top Z)^{-1}Z^\top, \qquad h_{ii}=H_{ii}.

HC2 weights:

ωi=e^i21hii.\omega_i=\frac{\hat e_i^2}{1-h_{ii}}.

Define the “meat”:

M=i=1nωi,ZiZiZdiag(ω1,,ωn)Z.M = \sum_{i=1}^n \omega_i, Z_i Z_i^\top Z^\top \operatorname{diag}(\omega_1,\dots,\omega_n) Z.

Robust covariance:

Var^(θ^)=BMB.\widehat{\mathrm{Var}}(\hat\theta) = BMB.

Extract the treatment-component variance and standard error:

Var^(τ^)=[Var^(θ^)]ττ,SE^(τ^)=Var^(τ^).\widehat{\mathrm{Var}}(\hat\tau) = \left[\widehat{\mathrm{Var}}(\hat\theta)\right]_{\tau\tau}, \qquad \widehat{\mathrm{SE}}(\hat\tau)=\sqrt{\widehat{\mathrm{Var}}(\hat\tau)}.

A generic test statistic is:

t=τ^SE^(τ^)t=\frac{\hat\tau}{\widehat{\mathrm{SE}}(\hat\tau)}

Absolute CI (using some critical value cαc_\alpha):

CIabs=τ^±cα,SE^(τ^)\mathrm{CI}_{abs} = \hat\tau \pm c_\alpha,\widehat{\mathrm{SE}}(\hat\tau)

If your implementation later “recovers” the effective cαc_\alpha from the computed CI bounds, one way to express it is:

cα=maxCIhiτ^,τ^CIloSE^(τ^).c_\alpha = \frac{\max{|\mathrm{CI}_{hi}-\hat\tau|,|\hat\tau-\mathrm{CI}_{lo}|}} {\widehat{\mathrm{SE}}(\hat\tau)}.

7) Relative CI delta method

Define the relative effect (percent) using the control mean:

μ^c=1nci:Di=0Yi,τ^rel=100τ^μ^c.\hat\mu_c=\frac{1}{n_c}\sum_{i:D_i=0}Y_i, \qquad \hat\tau_{rel}=100\cdot\frac{\hat\tau}{\hat\mu_c}.

Let

g(τ,μ)=100τμ.g(\tau,\mu)=100\cdot\frac{\tau}{\mu}.

Gradient:

gτ=100μ,gμ=100τμ2\frac{\partial g}{\partial \tau}=\frac{100}{\mu}, \qquad \frac{\partial g}{\partial \mu}=-100\cdot\frac{\tau}{\mu^2}

You already have Var^(τ^)\widehat{\mathrm{Var}}(\hat\tau) from HC2:

Var^(τ^)=[Var^(θ^)]ττ\widehat{\mathrm{Var}}(\hat\tau)=\left[\widehat{\mathrm{Var}}(\hat\theta)\right]_{\tau\tau}

Estimate the variance of the control mean with the usual sample-mean formula:

Var^(μ^c)=sc2nc,sc2=1nc1i:Di=0(Yiμ^c)2.\widehat{\mathrm{Var}}(\hat\mu_c)=\frac{s_c^2}{n_c}, \qquad s_c^2=\frac{1}{n_c-1}\sum_{i:D_i=0}(Y_i-\hat\mu_c)^2.

Delta method, ignoring covariance Cov(τ^,μ^c)=0\mathrm{Cov}(\hat\tau,\hat\mu_c)=0 (“nocov” rule):

Var^(τ^rel)(100μ^c)2Var^(τ^)+(100τ^μ^c2)2Var^(μ^c)\widehat{\mathrm{Var}}(\hat\tau_{rel}) \approx \left(\frac{100}{\hat\mu_c}\right)^2\widehat{\mathrm{Var}}(\hat\tau) + \left(100\cdot\frac{\hat\tau}{\hat\mu_c^2}\right)^2\widehat{\mathrm{Var}}(\hat\mu_c)

So:

SE^(τ^rel)=Var^(τ^rel)\widehat{\mathrm{SE}}(\hat\tau_{rel}) = \sqrt{\widehat{\mathrm{Var}}(\hat\tau_{rel})}

Use the same critical value cαc_\alpha as for the absolute CI:

CIrel=τ^rel±cαSE^(τ^rel)\mathrm{CI}_{rel} = \hat\tau_{rel} \pm c_\alpha\widehat{\mathrm{SE}}(\hat\tau_{rel})

8) Math pseudocode (only math)

code.text

References

1) Data + target estimand (potential outcomes, ATE/ITT)

  • Neyman (1923; English translation 1990) — foundational “Neyman model” for randomized experiments; potential outcomes framing and unbiased difference-in-means for average effects. (mimuw.edu.pl)

  • Rubin (1974) — formal potential outcomes / causal effects language for randomized and nonrandomized studies; ATE as a target estimand. (Journal of Educational Psychology; DOI: 10.1037/h0037350.) (Demographic Research)

  • Holland (1986) — classic “Statistics and Causal Inference”; clarifies potential outcomes notation and estimands like ATE. (JSTOR)

  • Imbens (2004) — explicit discussion of average treatment effects as estimands (broader than RCTs, but standard for ATE notation/targets). (MIT Press Direct)


2) Global centering of covariates (full-sample centering) + why it matters with interactions

  • Lin (2013) — uses the fully-interacted regression adjustment (Lin estimator) and discusses centering (often stated “without loss of generality, center covariates”) to interpret the main treatment coefficient as an average effect. (Project Euclid)

  • Brambor, Clark & Golder (2006) — clear explanation of interaction models: main effects are evaluated at the moderator equal to 0, and mean-centering shifts the reference point to the mean (interpretation changes, fitted values don’t). Great citation for your “τ\tau is the effect at X=0X=0 if uncentered” statement. (Cambridge University Press & Assessment)

  • Lei & Ding (2021) — explicitly notes centering covariates (wlog) in the Lin regression-adjustment setup and studies its properties. (NSF Public Access Repository)


3) Build the Lin (2013) fully-interacted design matrix (Z=[1, D, X^c, D\odot X^c])

  • Lin (2013) — the canonical reference for the “fully interacted” OLS adjustment in experiments (treatment, covariates, and treatment×covariate interactions). (Project Euclid)

  • Lei & Ding (2021) — modern formalization/extension of Lin’s regression adjustment; repeats the same design structure in a randomized-experiment framework. (IDEAS/RePEc)


4) Why the coefficient on (D) is the ATE/ITT (with global centering + interactions)

  • Lin (2013) — main theoretical justification: with treatment×covariate interactions, the OLS coefficient on treatment targets the average treatment effect (and can’t hurt asymptotic precision under the Neyman model). (Project Euclid)

  • Freedman (2008) — motivates why naive regression adjustment can misbehave and why one should be careful; this is exactly the critique Lin reexamines (useful for your “why we do Lin spec” story). (JSTOR)

  • Lei & Ding (2021) — provides additional asymptotic results/guarantees for the Lin adjustment under regimes with many covariates (supporting your “(\hat\tau) is ATE/ITT under this spec” claim). (IDEAS/RePEc)


5) OLS fit (point estimate) for (\hat\tau) under this regression-adjustment estimator

You typically don’t need a separate “OLS paper” citation if you already cite the experimental regression-adjustment papers that define the estimator as OLS on that (Z).

  • Lin (2013) — defines the estimator via OLS on ([1,D,X,D\cdot X]). (Project Euclid)

  • Freedman (2008) — discusses regression adjustment in experiments (OLS adjustment) and what it is / isn’t justified by randomization. (Department of Statistics)


6) Robust covariance / standard error for (\hat\tau) (HC2 only)

  • White (1980) — original heteroskedasticity-consistent “sandwich” covariance for OLS (foundation for HC estimators). (JSTOR)

  • MacKinnon & White (1985) — introduces the HC family including HC2 leverage adjustment (\hat e_i^2/(1-h_{ii})); this is your key HC2 citation. (J. Econometrics; DOI: 10.1016/0304-4076(85)90158-7.) (ScienceDirect)

  • Zeileis (2004) — widely cited computational/econometrics reference that summarizes HC estimators and their implementations (nice for “HC2 definition in practice”). (jstatsoft.org)

  • Long & Ervin (2000) — practical discussion of using heteroskedasticity-consistent SEs and finite-sample considerations (optional, but commonly cited). (JSTOR)


7) Relative CI via delta method (your “nocov” delta option)

  • Oehlert (1992) — short classic note reviewing the delta method and when it works well (perfect for citing your Taylor/gradient variance approximation). (Taylor & Francis Online)

  • Deng et al. (2018) — applied “metric analytics / online experiments” reference that explicitly uses the delta method for percent change / relative lift style estimands (very aligned with your (\hat\tau_{rel}=100\hat\tau/\hat\mu_c)). (Alex Deng)