Linear and Nonlinear Data Generating Process benchmarking
Linear and Nonlinear Data Generating Process benchmarking
Linear Data Generating Process (DGP)
Let i = 1 , … , n i=1,\dots,n i = 1 , … , n . Draw confounders:
tenure i ∼ N ( 24 , 12 2 ) \text{tenure}_i \sim \mathcal N(24,\ 12^2) tenure i ∼ N ( 24 , 1 2 2 ) sessions i ∼ N ( 5 , 2 2 ) \text{sessions}_i \sim \mathcal N(5,\ 2^2) sessions i ∼ N ( 5 , 2 2 ) spend i ∼ U n i f ( 0 , 200 ) \text{spend}_i \sim \mathrm{Unif}(0,200) spend i ∼ Unif ( 0 , 200 ) prem i ∼ B e r n o u l l i ( 0.25 ) \text{prem}_i \sim \mathrm{Bernoulli}(0.25) prem i ∼ Bernoulli ( 0.25 ) urban i ∼ B e r n o u l l i ( 0.60 ) \text{urban}_i \sim \mathrm{Bernoulli}(0.60) urban i ∼ Bernoulli ( 0.60 ) Stack them as
X i : = [ tenure i , sessions i , spend i , prem i , urban i ] ⊤ ∈ R 5 X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5 X i := [ tenure i , sessions i , spend i , prem i , urban i ] ⊤ ∈ R 5 Treatment model m ( x ) ≡ Pr ( D = 1 ∣ X = x ) = σ ( α d + x ⊤ β t ) , σ ( z ) = 1 1 + e − z , m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + x^\top \beta_t\big),
\qquad
\sigma(z)=\frac{1}{1+e^{-z}}, m ( x ) ≡ Pr ( D = 1 ∣ X = x ) = σ ( α d + x ⊤ β t ) , σ ( z ) = 1 + e − z 1 , with α d \alpha_d α d calibrated (by bisection) so that \mathbb{E}[D]\approx 0.20 , and
β t ⊤ = [ , 0.08 , 0.12 , 0.004 , 0.25 , 0.10 ] \beta_t^\top = \big[, 0.08,\ 0.12,\ 0.004,\ 0.25,\ 0.10 \big] β t ⊤ = [ , 0.08 , 0.12 , 0.004 , 0.25 , 0.10 ] Then
D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)
Outcome model Y i = α y + X i ⊤ β y + Θ D i + ε i , ε i ∼ N ( 0 , σ y 2 ) , Y_i = \alpha_y + X_i^\top \beta_y + \Theta D_i + \varepsilon_i,
\qquad
\varepsilon_i \sim \mathcal N(0,\sigma_y^2), Y i = α y + X i ⊤ β y + Θ D i + ε i , ε i ∼ N ( 0 , σ y 2 ) , with α y = 0 , σ y = 1 , Θ = 0.80 \alpha_y=0,\ \sigma_y=1,\ \Theta=0.80 α y = 0 , σ y = 1 , Θ = 0.80 , and
β y ⊤ = [ 0.05 , 0.60 , 0.005 , 0.80 , 0.20 ] \beta_y^\top = \big[0.05,\ 0.60,\ 0.005,\ 0.80,\ 0.20 \big] β y ⊤ = [ 0.05 , 0.60 , 0.005 , 0.80 , 0.20 ] Oracle nuisances (IRM) and CATE m ( x ) = σ ( α d + x ⊤ β t ) m(x) = \sigma\!\big(\alpha_d + x^\top \beta_t\big) m ( x ) = σ ( α d + x ⊤ β t ) g 0 ( x ) = E [ Y ∣ X = x , D = 0 ] = α y + x ⊤ β y g_0(x) = \mathbb{E}[Y \mid X=x, D=0]
= \alpha_y + x^\top \beta_y g 0 ( x ) = E [ Y ∣ X = x , D = 0 ] = α y + x ⊤ β y g 1 ( x ) = E [ Y ∣ X = x , D = 1 ] = α y + x ⊤ β y + Θ g_1(x) = \mathbb{E}[Y \mid X=x, D=1]
= \alpha_y + x^\top \beta_y + \Theta g 1 ( x ) = E [ Y ∣ X = x , D = 1 ] = α y + x ⊤ β y + Θ C A T E ( x ) = g 1 ( x ) − g 0 ( x ) = Θ (constant) \mathrm{CATE}(x) = g_1(x) - g_0(x)
= \Theta \quad \text{(constant)} CATE ( x ) = g 1 ( x ) − g 0 ( x ) = Θ (constant) Targets Θ ATE = E [ Y ( 1 ) − Y ( 0 ) ] = Θ , Θ ATTE = E [ Y ( 1 ) − Y ( 0 ) ∣ D = 1 ] = Θ . \Theta_{\text{ATE}}
= \mathbb{E}\big[Y(1)-Y(0)\big]
= \Theta,
\qquad
\Theta_{\text{ATTE}}
= \mathbb{E}\big[Y(1)-Y(0)\mid D=1\big]
= \Theta. Θ ATE = E [ Y ( 1 ) − Y ( 0 ) ] = Θ , Θ ATTE = E [ Y ( 1 ) − Y ( 0 ) ∣ D = 1 ] = Θ. So under this constant-effect DGP:
Θ ATE = Θ ATTE = 0.80. \Theta_{\text{ATE}} = \Theta_{\text{ATTE}} = 0.80. Θ ATE = Θ ATTE = 0.80.
Let's generate the Data
Result
Treatment share ≈ 0.2052
Ground-truth ATE from the DGP: 0.800
Ground-truth ATT from the DGP: 0.800
Wrap it in CausalData Object
Result
y d tenure_months avg_sessions_week spend_last_month premium_user urban_resident 0 1.903910 0.0 12.130544 4.056687 181.570607 0.0 0.0 1 3.388144 0.0 19.586560 1.671561 182.793598 0.0 0.0 2 8.456512 1.0 39.455103 5.452889 125.185708 1.0 1.0 3 5.535970 1.0 26.327693 5.051629 4.932905 0.0 1.0 4 4.965140 1.0 35.042771 4.933996 23.577407 0.0 0.0
Estimate ATE and ATTE
Result
Real ATE = 0.8 VS Estimated = 0.7506313322797796 in (0.6668061755942251, 0.8344564889653342)
Result
Real ATTE = 0.8 VS Estimated = 0.8238669785039335 in (0.7558326760362799, 0.8919012809715872)
The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals.
Adding more data makes the intervals narrower. You can try it when changes the n in GDP
Nonlinear Data Generating Process (DGP)
Let i = 1 , … , n i=1,\dots,n i = 1 , … , n . Draw confounders:
tenure i ∼ N ( 24 , 12 2 ) \text{tenure}_i \sim \mathcal N(24,\ 12^2) tenure i ∼ N ( 24 , 1 2 2 ) sessions i ∼ N ( 5 , 2 2 ) \text{sessions}_i \sim \mathcal N(5,\ 2^2) sessions i ∼ N ( 5 , 2 2 ) spend i ∼ U n i f ( 0 , 200 ) \text{spend}_i \sim \mathrm{Unif}(0,200) spend i ∼ Unif ( 0 , 200 ) prem i ∼ B e r n o u l l i ( 0.25 ) \text{prem}_i \sim \mathrm{Bernoulli}(0.25) prem i ∼ Bernoulli ( 0.25 ) urban i ∼ B e r n o u l l i ( 0.60 ) \text{urban}_i \sim \mathrm{Bernoulli}(0.60) urban i ∼ Bernoulli ( 0.60 ) Stack them as
X i : = [ tenure i , sessions i , spend i , prem i , urban i ] ⊤ ∈ R 5 X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5 X i := [ tenure i , sessions i , spend i , prem i , urban i ] ⊤ ∈ R 5 Treatment model Define
m ( x ) ≡ Pr ( D = 1 ∣ X = x ) = σ ( α d + g d ( x ) ) , σ ( z ) = 1 1 + e − z , m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + g_d(x)\big),
\qquad
\sigma(z)=\frac{1}{1+e^{-z}}, m ( x ) ≡ Pr ( D = 1 ∣ X = x ) = σ ( α d + g d ( x ) ) , σ ( z ) = 1 + e − z 1 , with α d \alpha_d α d calibrated (by bisection) so that \mathbb{E}[D]\approx 0.20 .
The nonlinear score includes alignment with the treatment effect τ ( x ) \tau(x) τ ( x ) :
g d ( x ) = 1.10 tanh ( 0.06 ( spend − 100 ) ) + 1.00 σ ( 0.60 ( sessions − 5 ) ) + 0.50 log ( 1 + tenure ) + 0.50 prem + 0.25 urban + 0.90 prem ⋅ spend > 120 + 0.30 urban ⋅ tenure < 12 + 0.80 τ ( x ) \boxed{
\begin{aligned}
g_d(x) &= 1.10\tanh\big(0.06(\text{spend}-100)\big)
+ 1.00\sigma\big(0.60(\text{sessions}-5)\big)
+ 0.50\log\big(1+\text{tenure}\big)
+ 0.50\text{prem}
+ 0.25\text{urban}
+ 0.90\text{prem}\cdot\mathbb{\text{spend}>120}
+ 0.30\text{urban}\cdot\mathbb{\text{tenure}<12}
+0.80\tau(x)
\end{aligned}
} g d ( x ) = 1.10 tanh ( 0.06 ( spend − 100 ) ) + 1.00 σ ( 0.60 ( sessions − 5 ) ) + 0.50 log ( 1 + tenure ) + 0.50 prem + 0.25 urban + 0.90 prem ⋅ spend > 120 + 0.30 urban ⋅ tenure < 12 + 0.80 τ ( x ) where z + = max ( z , 0 ) z_+ = \max(z,0) z + = max ( z , 0 ) and \mathbb{1}{\cdot} is the indicator.
Then
D i ∣ X i ∼ B e r n o u l l i ( m ( X i ) ) D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big) D i ∣ X i ∼ Bernoulli ( m ( X i ) ) Outcome model Y i = α y + g y ( X i ) + D i τ ( X i ) + ε i , ε i ∼ N ( 0 , σ y 2 ) , Y_i = \alpha_y + g_y(X_i) + D_i\tau(X_i) + \varepsilon_i,
\qquad
\varepsilon_i \sim \mathcal N(0,\sigma_y^2), Y i = α y + g y ( X i ) + D i τ ( X i ) + ε i , ε i ∼ N ( 0 , σ y 2 ) , with α y = 0 , σ y = 1 \alpha_y=0,\ \sigma_y=1 α y = 0 , σ y = 1 .
The baseline outcome component is nonlinear:
g y ( x ) = 0.70 tanh ( 0.03 ( spend − 80 ) ) + 0.50 sessions + 0.40 log ( 1 + tenure ) + 0.30 prem + 0.10 urban − 0.10 spend < 20 \boxed{
\begin{aligned}
g_y(x) &= 0.70\tanh\big(0.03(\text{spend}-80)\big)
+ 0.50\sqrt{\text{sessions}}
+ 0.40\log\big(1+\text{tenure}\big)
+ 0.30\text{prem}
+ 0.10\text{urban}
- 0.10\mathbb{\text{spend}<20}
\end{aligned}
} g y ( x ) = 0.70 tanh ( 0.03 ( spend − 80 ) ) + 0.50 sessions + 0.40 log ( 1 + tenure ) + 0.30 prem + 0.10 urban − 0.10 spend < 20 The heterogeneous treatment effect (CATE) is
τ ( x ) = 0.40 + 0.60 σ ( − 0.40 ( sessions − 5 ) ) + 2.00 prem ⋅ spend > 120 + 0.10 urban ⋅ tenure < 12 \boxed{
\begin{aligned}
\tau(x) &= 0.40
+ 0.60\sigma\big(-0.40(\text{sessions}-5)\big)
+ 2.00\text{prem}\cdot\mathbb{\text{spend}>120} \
+ 0.10\text{urban}\cdot\mathbb{\text{tenure}<12}
\end{aligned}
} τ ( x ) = 0.40 + 0.60 σ ( − 0.40 ( sessions − 5 ) ) + 2.00 prem ⋅ spend > 120 + 0.10 urban ⋅ tenure < 12 Oracle nuisances (IRM) and CATE m ( x ) = σ ( α d + g d ( x ) ) m(x) = \sigma\big(\alpha_d + g_d(x)\big) m ( x ) = σ ( α d + g d ( x ) ) g 0 ( x ) = E [ Y ∣ X = x , D = 0 ] = α y + g y ( x ) g_0(x) = \mathbb{E}[Y \mid X=x, D=0]
= \alpha_y + g_y(x) g 0 ( x ) = E [ Y ∣ X = x , D = 0 ] = α y + g y ( x ) g 1 ( x ) = E [ Y ∣ X = x , D = 1 ] = α y + g y ( x ) + τ ( x ) g_1(x) = \mathbb{E}[Y \mid X=x, D=1]
= \alpha_y + g_y(x) + \tau(x) g 1 ( x ) = E [ Y ∣ X = x , D = 1 ] = α y + g y ( x ) + τ ( x ) C A T E ( x ) = g 1 ( x ) − g 0 ( x ) = τ ( x ) \mathrm{CATE}(x) = g_1(x) - g_0(x)
= \tau(x) CATE ( x ) = g 1 ( x ) − g 0 ( x ) = τ ( x ) Targets Θ ATE = E [ τ ( X ) ] , Θ ATTE = E [ τ ( X ) ∣ D = 1 ] . \Theta_{\text{ATE}}
= \mathbb{E}\big[\tau(X)\big],
\qquad
\Theta_{\text{ATTE}}
= \mathbb{E}\big[\tau(X)\mid D=1\big]. Θ ATE = E [ τ ( X ) ] , Θ ATTE = E [ τ ( X ) ∣ D = 1 ] .
Result
Treatment share ≈ 0.2036
Ground-truth ATE from the DGP: 0.913
Ground-truth ATT from the DGP: 1.567
Result
y d tenure_months avg_sessions_week spend_last_month premium_user urban_resident 0 0.689404 0.0 12.130544 4.056687 181.570607 0.0 0.0 1 3.045282 0.0 19.586560 1.671561 182.793598 0.0 0.0 2 7.173595 1.0 39.455103 5.452889 125.185708 1.0 1.0 3 1.926216 0.0 26.327693 5.051629 4.932905 0.0 1.0 4 1.225088 0.0 35.042771 4.933996 23.577407 0.0 0.0
Result
Real ATE = 0.913 VS Estimated = 0.9917276396749556 in (0.869543879249174, 1.1139114001007373)
Result
Real ATTE = 1.567 VS Estimated = 1.6433239376940343 in (1.4972790851652928, 1.7893687902227757)
The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals.
Adding more data makes the intervals narrower. You can try it when changes the n in GDP