Scenario7 min read

0.Introduction to Causal Inference

The world is complex. Numerous factors can impact your metrics. To make decisions correctly, we should use modern causal inference techniques.

0.Introduction to Causal Inference

Why do we explore the world through data?

The world is complex. Numerous factors can impact your metrics. To make decisions correctly, we should use modern causal inference techniques.

ΔVoD=E ⁣[V(decisionuse trustworthy CI)]E ⁣[V(decisionno trustworthy CI)]CostCI. \Delta \mathrm{VoD} = \mathbb{E}\!\left[V(\text{decision}\mid \text{use trustworthy CI})\right] - \mathbb{E}\!\left[V(\text{decision}\mid \text{no trustworthy CI})\right] - Cost_{\text{CI}}.

ΔVoD\Delta \mathrm{VoD} - Delta Value of Decisions

CostCICost_{\text{CI}} - costs of implementing causal inference

To gain ΔVoD\Delta \mathrm{VoD} big companies like Google, Meta, Microsoft invest hundreds millions dollars in data infrastructure for decision making. Their growth is largely due to the competitive advantage of their AB platforms.

What is causality?

There are many ways to explain what causality is. We will explain it from the viewpoint of statistics. Data scientists, machine learning engineers think about causality in the Rubin Causal Model (RCM), also called the potential outcomes framework. The potential outcomes framework defines causal effects by comparing what would happen to the same unit under different treatment conditions.

  • Y(1)Y(1): what would happen with the treatment
  • Y(0)Y(0): what would happen without the treatment

We only ever observe one of them. This “missing counterfactual” is the fundamental challenge of causal inference. Everything we do is about approximating the missing outcome in a principled way.

Units

Clients, patients, companies, cities, clusters, all of them can be your units.

  • Measured data on units (outcomes, treatment, and pre-treatment confounders)
  • The research result on your sample should be representative of the population on which you are making your decision.

Treatment

A treatment is a potential intervention or change in exposure that could affect unit behavior: a new onboarding flow, a price change, an ad campaign, a policy, a feature rollout, an eligibility rule, or a recommendation model update.

  • Di=1D_i = 1 if unit ii received treatment
  • Di=0D_i = 0 if unit ii did not receive treatment

Outcome

An outcome is the metric you want to causally affect and measure: conversion, revenue, retention, churn, NPS, cost, latency, clinical response, or any operational KPI. We denote the observed outcome by YiY_i.

Good outcomes for causal work typically have:

  • A clear measurement window (e.g., “7-day revenue after exposure”).
  • Stable definition across time and groups (no metric drift).

Confounders

Confounders are pre-treatment variables that influence both treatment assignment and the outcome. We denote them by XiX_i. Confounding is exactly what breaks naive comparisons: treated and untreated units differ systematically, so outcome differences may reflect selection rather than impact.

Examples:

  • Users with higher intent are more likely to see a feature and also more likely to convert.
  • Wealthier regions are more likely to adopt a policy and also have different baseline outcomes.
  • High-value customers are targeted by promotions and also spend more regardless.

A critical rule: only include pre-treatment variables in XiX_i.

Notation

  • Units: i=1,,ni=1,\dots,n
  • Treatment: Di{0,1}D_i\in\{0,1\}
  • Outcome: YiY_i (observed outcome)
  • Potential outcomes: Yi(1),Yi(0)Y_i(1), Y_i(0)
  • Confounders: XiX_i

Observed data relates to potential outcomes via:

Yi=DiYi(1)+(1Di)Yi(0). Y_i = D_i Y_i(1) + (1-D_i) Y_i(0).

Estimands

iSegment XXGroup GGTreatment DDOutcome YY
1New0112
2New1110
3New007
4New106
5Returning019
6Returning1111
7Returning008
8Returning107

ITE — Individual Treatment Effect

Definition

τi=Yi(1)Yi(0).\tau_i = Y_i(1) - Y_i(0).

Key point: for each unit, we observe only one of Yi(1)Y_i(1) or Yi(0)Y_i(0).

Unit iiDiD_iObserved YiY_iObserved potential outcomeMissing potential outcome
1112Y1(1)=12Y_1(1)=12Y1(0)= ?Y_1(0)=\ ?

So the ITE for unit 1 would be:

τ1=12Y1(0),\tau_1 = 12 - Y_1(0),

but Y1(0)Y_1(0) is not observed. Any ITE estimate requires a model/assumption to impute the missing counterfactual.


CATE — Conditional Average Treatment Effect

Definition

τ(x)=E[Y(1)Y(0)X=x].\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x].

Table split by segment XX

Segment: New
iDDYY
1112
2110
307
406

Compute (difference-in-means within segment):

  • Treated mean: (12+10)/2=11(12+10)/2 = 11
  • Control mean: (7+6)/2=6.5(7+6)/2 = 6.5
τ^(New)=116.5=4.5\widehat{\tau}(\text{New}) = 11 - 6.5 = 4.5
Segment: Returning
iDDYY
519
6111
708
807
  • Treated mean: (9+11)/2=10(9+11)/2 = 10
  • Control mean: (8+7)/2=7.5(8+7)/2 = 7.5
τ^(Returning)=107.5=2.5\widehat{\tau}(\text{Returning}) = 10 - 7.5 = 2.5

GATE — Group Average Treatment Effect

Definition

τG=E[Y(1)Y(0)G=1].\tau_G = \mathbb{E}[Y(1)-Y(0)\mid G=1].

Table split by group GG

Group: G=1G=1
iGGDDYY
21110
4106
61111
8107
  • Treated mean (within G=1G=1): (10+11)/2=10.5(10+11)/2 = 10.5
  • Control mean (within G=1G=1): (6+7)/2=6.5(6+7)/2 = 6.5
τ^G=1=10.56.5=4.0\widehat{\tau}_{G=1} = 10.5 - 6.5 = 4.0
Group: G=0G=0
iGGDDYY
10112
3007
5019
7008
  • Treated mean (within G=0G=0): (12+9)/2=10.5(12+9)/2 = 10.5
  • Control mean (within G=0G=0): (7+8)/2=7.5(7+8)/2 = 7.5
τ^G=0=10.57.5=3.0\widehat{\tau}_{G=0} = 10.5 - 7.5 = 3.0

ATE — Average Treatment Effect the most common estimand

Definition

ATE=E[Y(1)Y(0)].\mathrm{ATE} = \mathbb{E}[Y(1)-Y(0)].

Visualization: treated vs control overall

Treated (D=1D=1): units 1,2,5,6 → outcomes 12,10,9,1112,10,9,11 Mean: (12+10+9+11)/4=42/4=10.5(12+10+9+11)/4 = 42/4 = 10.5

Control (D=0D=0): units 3,4,7,8 → outcomes 7,6,8,77,6,8,7 Mean: (7+6+8+7)/4=28/4=7.0(7+6+8+7)/4 = 28/4 = 7.0

ATE^=10.57.0=3.5\widehat{\mathrm{ATE}} = 10.5 - 7.0 = 3.5

ATTE — Average Treatment Effect on the Treated

Definition

ATT=E[Y(1)Y(0)D=1].\mathrm{ATT} = \mathbb{E}[Y(1)-Y(0)\mid D=1].

To visualize ATT, we need a counterfactual Y(0)Y(0) for treated units. We will impute Y(0)Y(0) for each treated unit using the 'black-box' model

Treated unit iiDSegment XXObserved YiY_iImputed Yi(0)^\widehat{Y_i(0)}Difference YiYi(0)^Y_i - \widehat{Y_i(0)}
11New126.55.5
21New106.53.5
51Returning97.51.5
61Returning117.53.5

Average over treated units:

ATTE^=(5.5+3.5+1.5+3.5)/4=14/4=3.5\widehat{\mathrm{ATTE}} = (5.5+3.5+1.5+3.5)/4 = 14/4 = 3.5

(Here it matches the ATE; that will not generally happen.)


LATE — Local Average Treatment Effect

Definition Let Z{0,1}Z\in\{0,1\} be an instrument (e.g., randomized encouragement), DD the actual treatment uptake, and YY the outcome. Then:

LATE=E[YZ=1]E[YZ=0]E[DZ=1]E[DZ=0].\mathrm{LATE} = \frac{\mathbb{E}[Y\mid Z=1]-\mathbb{E}[Y\mid Z=0]} {\mathbb{E}[D\mid Z=1]-\mathbb{E}[D\mid Z=0]}.

Let's add an instrument

iInstrument ZZTreatment DDOutcome YY
11110
21111
3107
41112
5007
6019
7006
8008

Step-by-step calculation

First stage (effect of ZZ on DD):

  • E[DZ=1]=(1+1+0+1)/4=3/4=0.75\mathbb{E}[D\mid Z=1] = (1+1+0+1)/4 = 3/4 = 0.75
  • E[DZ=0]=(0+1+0+0)/4=1/4=0.25\mathbb{E}[D\mid Z=0] = (0+1+0+0)/4 = 1/4 = 0.25
  • Difference: 0.750.25=0.500.75 - 0.25 = 0.50

Reduced form (effect of ZZ on YY):

  • E[YZ=1]=(10+11+7+12)/4=40/4=10\mathbb{E}[Y\mid Z=1] = (10+11+7+12)/4 = 40/4 = 10
  • E[YZ=0]=(7+9+6+8)/4=30/4=7.5\mathbb{E}[Y\mid Z=0] = (7+9+6+8)/4 = 30/4 = 7.5
  • Difference: 107.5=2.510 - 7.5 = 2.5

LATE:

LATE^=2.50.5=5.0\widehat{\mathrm{LATE}} = \frac{2.5}{0.5} = 5.0

Interpretation: the estimated causal effect applies to compliers—units whose treatment status is changed by the instrument.