0.Introduction to Causal Inference

Why do we explore the world through data?

The world is complex. Numerous factors can impact your metrics. To make decisions correctly, we should use modern causal inference techniques.

\Delta \mathrm{VoD} = \mathbb{E}\!\left[V(\text{decision}\mid \text{use trustworthy CI})\right] - \mathbb{E}\!\left[V(\text{decision}\mid \text{no trustworthy CI})\right] - Cost_{\text{CI}}.

$\Delta \mathrm{VoD}$ - Delta Value of Decisions

$Cost_{\text{CI}}$ - costs of implementing causal inference

To gain $\Delta \mathrm{VoD}$ big companies like Google, Meta, Microsoft invest hundreds millions dollars in data infrastructure for decision making. Their growth is largely due to the competitive advantage of their AB platforms.

What is causality?

There are many ways to explain what causality is. We will explain it from the viewpoint of statistics. Data scientists, machine learning engineers think about causality in the Rubin Causal Model (RCM), also called the potential outcomes framework. The potential outcomes framework defines causal effects by comparing what would happen to the same unit under different treatment conditions.

$Y(1)$ : what would happen with the treatment
$Y(0)$ : what would happen without the treatment

We only ever observe one of them. This “missing counterfactual” is the fundamental challenge of causal inference. Everything we do is about approximating the missing outcome in a principled way.

Units

Clients, patients, companies, cities, clusters, all of them can be your units.

Measured data on units (outcomes, treatment, and pre-treatment confounders)
The research result on your sample should be representative of the population on which you are making your decision.

Treatment

A treatment is a potential intervention or change in exposure that could affect unit behavior: a new onboarding flow, a price change, an ad campaign, a policy, a feature rollout, an eligibility rule, or a recommendation model update.

$D_i = 1$ if unit $i$ received treatment
$D_i = 0$ if unit $i$ did not receive treatment

Outcome

An outcome is the metric you want to causally affect and measure: conversion, revenue, retention, churn, NPS, cost, latency, clinical response, or any operational KPI. We denote the observed outcome by $Y_i$ .

Good outcomes for causal work typically have:

A clear measurement window (e.g., “7-day revenue after exposure”).
Stable definition across time and groups (no metric drift).

Confounders

Confounders are pre-treatment variables that influence both treatment assignment and the outcome. We denote them by $X_i$ . Confounding is exactly what breaks naive comparisons: treated and untreated units differ systematically, so outcome differences may reflect selection rather than impact.

Examples:

Users with higher intent are more likely to see a feature and also more likely to convert.
Wealthier regions are more likely to adopt a policy and also have different baseline outcomes.
High-value customers are targeted by promotions and also spend more regardless.

A critical rule: only include pre-treatment variables in $X_i$ .

Notation

Units: $i=1,\dots,n$
Treatment: $D_i\in\{0,1\}$
Outcome: $Y_i$ (observed outcome)
Potential outcomes: $Y_i(1), Y_i(0)$
Confounders: $X_i$

Observed data relates to potential outcomes via:

Y_i = D_i Y_i(1) + (1-D_i) Y_i(0).

Estimands

i	Segment $X$	Group $G$	Treatment $D$	Outcome $Y$
1	New	0	1	12
2	New	1	1	10
3	New	0	0	7
4	New	1	0	6
5	Returning	0	1	9
6	Returning	1	1	11
7	Returning	0	0	8
8	Returning	1	0	7

ITE — Individual Treatment Effect

Definition

\tau_i = Y_i(1) - Y_i(0).

Key point: for each unit, we observe only one of $Y_i(1)$ or $Y_i(0)$ .

Unit $i$	$D_i$	Observed $Y_i$	Observed potential outcome	Missing potential outcome
1	1	12	$Y_1(1)=12$	$Y_1(0)=\ ?$

So the ITE for unit 1 would be:

\tau_1 = 12 - Y_1(0),

but $Y_1(0)$ is not observed. Any ITE estimate requires a model/assumption to impute the missing counterfactual.

CATE — Conditional Average Treatment Effect

Definition

\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x].

Table split by segment $X$

Segment: New

i	$D$	$Y$
1	1	12
2	1	10
3	0	7
4	0	6

Compute (difference-in-means within segment):

Treated mean: $(12+10)/2 = 11$
Control mean: $(7+6)/2 = 6.5$

\widehat{\tau}(\text{New}) = 11 - 6.5 = 4.5

Segment: Returning

i	$D$	$Y$
5	1	9
6	1	11
7	0	8
8	0	7

Treated mean: $(9+11)/2 = 10$
Control mean: $(8+7)/2 = 7.5$

\widehat{\tau}(\text{Returning}) = 10 - 7.5 = 2.5

GATE — Group Average Treatment Effect

Definition

\tau_G = \mathbb{E}[Y(1)-Y(0)\mid G=1].

Table split by group $G$

Group: $G=1$

i	$G$	$D$	$Y$
2	1	1	10
4	1	0	6
6	1	1	11
8	1	0	7

Treated mean (within $G=1$ ): $(10+11)/2 = 10.5$
Control mean (within $G=1$ ): $(6+7)/2 = 6.5$

\widehat{\tau}_{G=1} = 10.5 - 6.5 = 4.0

Group: $G=0$

i	$D$	$Y$
1	1	12
3	0	7
5	1	9
7	0	8

Treated mean (within $G=0$ ): $(12+9)/2 = 10.5$
Control mean (within $G=0$ ): $(7+8)/2 = 7.5$

\widehat{\tau}_{G=0} = 10.5 - 7.5 = 3.0

ATE — Average Treatment Effect the most common estimand

Definition

\mathrm{ATE} = \mathbb{E}[Y(1)-Y(0)].

Visualization: treated vs control overall

Treated ( $D=1$ ): units 1,2,5,6 → outcomes $12,10,9,11$ Mean: $(12+10+9+11)/4 = 42/4 = 10.5$

Control ( $D=0$ ): units 3,4,7,8 → outcomes $7,6,8,7$ Mean: $(7+6+8+7)/4 = 28/4 = 7.0$

\widehat{\mathrm{ATE}} = 10.5 - 7.0 = 3.5

ATTE — Average Treatment Effect on the Treated

Definition

\mathrm{ATT} = \mathbb{E}[Y(1)-Y(0)\mid D=1].

To visualize ATT, we need a counterfactual $Y(0)$ for treated units. We will impute $Y(0)$ for each treated unit using the 'black-box' model

Treated unit $i$	D	Segment $X$	Observed $Y_i$	Imputed $\widehat{Y_i(0)}$	Difference $Y_i - \widehat{Y_i(0)}$
1	1	New	12	6.5	5.5
2	1	New	10	6.5	3.5
5	1	Returning	9	7.5	1.5
6	1	Returning	11	7.5	3.5

Average over treated units:

\widehat{\mathrm{ATTE}} = (5.5+3.5+1.5+3.5)/4 = 14/4 = 3.5

(Here it matches the ATE; that will not generally happen.)

LATE — Local Average Treatment Effect

Definition Let $Z\in\{0,1\}$ be an instrument (e.g., randomized encouragement), $D$ the actual treatment uptake, and $Y$ the outcome. Then:

\mathrm{LATE} = \frac{\mathbb{E}[Y\mid Z=1]-\mathbb{E}[Y\mid Z=0]} {\mathbb{E}[D\mid Z=1]-\mathbb{E}[D\mid Z=0]}.

Let's add an instrument

i	Instrument $Z$	Treatment $D$	Outcome $Y$
1	1	1	10
2	1	1	11
3	1	0	7
4	1	1	12
5	0	0	7
6	0	1	9
7	0	0	6
8	0	0	8

Step-by-step calculation

First stage (effect of $Z$ on $D$ ):

$\mathbb{E}[D\mid Z=1] = (1+1+0+1)/4 = 3/4 = 0.75$
$\mathbb{E}[D\mid Z=0] = (0+1+0+0)/4 = 1/4 = 0.25$
Difference: $0.75 - 0.25 = 0.50$

Reduced form (effect of $Z$ on $Y$ ):

$\mathbb{E}[Y\mid Z=1] = (10+11+7+12)/4 = 40/4 = 10$
$\mathbb{E}[Y\mid Z=0] = (7+9+6+8)/4 = 30/4 = 7.5$
Difference: $10 - 7.5 = 2.5$

LATE:

\widehat{\mathrm{LATE}} = \frac{2.5}{0.5} = 5.0

Interpretation: the estimated causal effect applies to compliers—units whose treatment status is changed by the instrument.

0.Introduction to Causal Inference

0.Introduction to Causal Inference

Why do we explore the world through data?

What is causality?

Units

Treatment

Outcome

Confounders

Notation

Estimands

ITE — Individual Treatment Effect

CATE — Conditional Average Treatment Effect

Table split by segment XXX

Segment: New

Segment: Returning

GATE — Group Average Treatment Effect

Table split by group GGG

Group: G=1G=1G=1

Group: G=0G=0G=0

ATE — Average Treatment Effect the most common estimand

Visualization: treated vs control overall

ATTE — Average Treatment Effect on the Treated

LATE — Local Average Treatment Effect

Let's add an instrument

Step-by-step calculation

Table split by segment $X$

Table split by group $G$

Group: $G=1$

Group: $G=0$