0.Introduction to Causal Inference
Why do we explore the world through data?
The world is complex. Numerous factors can impact your metrics. To make decisions correctly, we should use modern causal inference techniques.
- Delta Value of Decisions
- costs of implementing causal inference
To gain big companies like Google, Meta, Microsoft invest hundreds millions dollars in data infrastructure for decision making. Their growth is largely due to the competitive advantage of their AB platforms.
What is causality?
There are many ways to explain what causality is. We will explain it from the viewpoint of statistics. Data scientists, machine learning engineers think about causality in the Rubin Causal Model (RCM), also called the potential outcomes framework. The potential outcomes framework defines causal effects by comparing what would happen to the same unit under different treatment conditions.
- : what would happen with the treatment
- : what would happen without the treatment
We only ever observe one of them. This “missing counterfactual” is the fundamental challenge of causal inference. Everything we do is about approximating the missing outcome in a principled way.
Units
Clients, patients, companies, cities, clusters, all of them can be your units.
- Measured data on units (outcomes, treatment, and pre-treatment confounders)
- The research result on your sample should be representative of the population on which you are making your decision.
Treatment
A treatment is a potential intervention or change in exposure that could affect unit behavior: a new onboarding flow, a price change, an ad campaign, a policy, a feature rollout, an eligibility rule, or a recommendation model update.
- if unit received treatment
- if unit did not receive treatment
Outcome
An outcome is the metric you want to causally affect and measure: conversion, revenue, retention, churn, NPS, cost, latency, clinical response, or any operational KPI. We denote the observed outcome by .
Good outcomes for causal work typically have:
- A clear measurement window (e.g., “7-day revenue after exposure”).
- Stable definition across time and groups (no metric drift).
Confounders
Confounders are pre-treatment variables that influence both treatment assignment and the outcome. We denote them by . Confounding is exactly what breaks naive comparisons: treated and untreated units differ systematically, so outcome differences may reflect selection rather than impact.
Examples:
- Users with higher intent are more likely to see a feature and also more likely to convert.
- Wealthier regions are more likely to adopt a policy and also have different baseline outcomes.
- High-value customers are targeted by promotions and also spend more regardless.
A critical rule: only include pre-treatment variables in .
Notation
- Units:
- Treatment:
- Outcome: (observed outcome)
- Potential outcomes:
- Confounders:
Observed data relates to potential outcomes via:
Estimands
| i | Segment | Group | Treatment | Outcome |
|---|---|---|---|---|
| 1 | New | 0 | 1 | 12 |
| 2 | New | 1 | 1 | 10 |
| 3 | New | 0 | 0 | 7 |
| 4 | New | 1 | 0 | 6 |
| 5 | Returning | 0 | 1 | 9 |
| 6 | Returning | 1 | 1 | 11 |
| 7 | Returning | 0 | 0 | 8 |
| 8 | Returning | 1 | 0 | 7 |
ITE — Individual Treatment Effect
Definition
Key point: for each unit, we observe only one of or .
| Unit | Observed | Observed potential outcome | Missing potential outcome | |
|---|---|---|---|---|
| 1 | 1 | 12 |
So the ITE for unit 1 would be:
but is not observed. Any ITE estimate requires a model/assumption to impute the missing counterfactual.
CATE — Conditional Average Treatment Effect
Definition
Table split by segment
Segment: New
| i | ||
|---|---|---|
| 1 | 1 | 12 |
| 2 | 1 | 10 |
| 3 | 0 | 7 |
| 4 | 0 | 6 |
Compute (difference-in-means within segment):
- Treated mean:
- Control mean:
Segment: Returning
| i | ||
|---|---|---|
| 5 | 1 | 9 |
| 6 | 1 | 11 |
| 7 | 0 | 8 |
| 8 | 0 | 7 |
- Treated mean:
- Control mean:
GATE — Group Average Treatment Effect
Definition
Table split by group
Group:
| i | |||
|---|---|---|---|
| 2 | 1 | 1 | 10 |
| 4 | 1 | 0 | 6 |
| 6 | 1 | 1 | 11 |
| 8 | 1 | 0 | 7 |
- Treated mean (within ):
- Control mean (within ):
Group:
| i | |||
|---|---|---|---|
| 1 | 0 | 1 | 12 |
| 3 | 0 | 0 | 7 |
| 5 | 0 | 1 | 9 |
| 7 | 0 | 0 | 8 |
- Treated mean (within ):
- Control mean (within ):
ATE — Average Treatment Effect the most common estimand
Definition
Visualization: treated vs control overall
Treated (): units 1,2,5,6 → outcomes Mean:
Control (): units 3,4,7,8 → outcomes Mean:
ATTE — Average Treatment Effect on the Treated
Definition
To visualize ATT, we need a counterfactual for treated units. We will impute for each treated unit using the 'black-box' model
| Treated unit | D | Segment | Observed | Imputed | Difference |
|---|---|---|---|---|---|
| 1 | 1 | New | 12 | 6.5 | 5.5 |
| 2 | 1 | New | 10 | 6.5 | 3.5 |
| 5 | 1 | Returning | 9 | 7.5 | 1.5 |
| 6 | 1 | Returning | 11 | 7.5 | 3.5 |
Average over treated units:
(Here it matches the ATE; that will not generally happen.)
LATE — Local Average Treatment Effect
Definition Let be an instrument (e.g., randomized encouragement), the actual treatment uptake, and the outcome. Then:
Let's add an instrument
| i | Instrument | Treatment | Outcome |
|---|---|---|---|
| 1 | 1 | 1 | 10 |
| 2 | 1 | 1 | 11 |
| 3 | 1 | 0 | 7 |
| 4 | 1 | 1 | 12 |
| 5 | 0 | 0 | 7 |
| 6 | 0 | 1 | 9 |
| 7 | 0 | 0 | 6 |
| 8 | 0 | 0 | 8 |
Step-by-step calculation
First stage (effect of on ):
- Difference:
Reduced form (effect of on ):
- Difference:
LATE:
Interpretation: the estimated causal effect applies to compliers—units whose treatment status is changed by the instrument.