Linear and Nonlinear Data Generating Process benchmarking

Linear Data Generating Process (DGP)

Let $i=1,\dots,n$ . Draw confounders:

\text{tenure}_i \sim \mathcal N(24,\ 12^2)

\text{sessions}_i \sim \mathcal N(5,\ 2^2)

\text{spend}_i \sim \mathrm{Unif}(0,200)

\text{prem}_i \sim \mathrm{Bernoulli}(0.25)

\text{urban}_i \sim \mathrm{Bernoulli}(0.60)

Stack them as

X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5

Treatment model

m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + x^\top \beta_t\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}},

with $\alpha_d$ calibrated (by bisection) so that $\mathbb{E}[D]\approx 0.20$ , and

\beta_t^\top = \big[, 0.08,\ 0.12,\ 0.004,\ 0.25,\ 0.10 \big]

Then $D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)$

Outcome model

Y_i = \alpha_y + X_i^\top \beta_y + \Theta D_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2),

with $\alpha_y=0,\ \sigma_y=1,\ \Theta=0.80$ , and

\beta_y^\top = \big[0.05,\ 0.60,\ 0.005,\ 0.80,\ 0.20 \big]

Oracle nuisances (IRM) and CATE

m(x) = \sigma\!\big(\alpha_d + x^\top \beta_t\big)

g_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + x^\top \beta_y

g_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + x^\top \beta_y + \Theta

\mathrm{CATE}(x) = g_1(x) - g_0(x) = \Theta \quad \text{(constant)}

Targets

\Theta_{\text{ATE}} = \mathbb{E}\big[Y(1)-Y(0)\big] = \Theta, \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[Y(1)-Y(0)\mid D=1\big] = \Theta.

So under this constant-effect DGP:

\Theta_{\text{ATE}} = \Theta_{\text{ATTE}} = 0.80.

Let's generate the Data

Result

Treatment share ≈ 0.2052 Ground-truth ATE from the DGP: 0.800 Ground-truth ATT from the DGP: 0.800

Wrap it in CausalData Object

Result

	y	d	tenure_months	avg_sessions_week	spend_last_month	premium_user	urban_resident
0	1.903910	0.0	12.130544	4.056687	181.570607	0.0	0.0
1	3.388144	0.0	19.586560	1.671561	182.793598	0.0	0.0
2	8.456512	1.0	39.455103	5.452889	125.185708	1.0	1.0
3	5.535970	1.0	26.327693	5.051629	4.932905	0.0	1.0
4	4.965140	1.0	35.042771	4.933996	23.577407	0.0	0.0

Estimate ATE and ATTE

Result

Real ATE = 0.8 VS Estimated = 0.7506313322797796 in (0.6668061755942251, 0.8344564889653342)

Result

Real ATTE = 0.8 VS Estimated = 0.8238669785039335 in (0.7558326760362799, 0.8919012809715872)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP

Nonlinear Data Generating Process (DGP)

Let $i=1,\dots,n$ . Draw confounders:

\text{tenure}_i \sim \mathcal N(24,\ 12^2)

\text{sessions}_i \sim \mathcal N(5,\ 2^2)

\text{spend}_i \sim \mathrm{Unif}(0,200)

\text{prem}_i \sim \mathrm{Bernoulli}(0.25)

\text{urban}_i \sim \mathrm{Bernoulli}(0.60)

Stack them as

X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5

Treatment model

Define

m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + g_d(x)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}},

with $\alpha_d$ calibrated (by bisection) so that $\mathbb{E}[D]\approx 0.20$ .

The nonlinear score includes alignment with the treatment effect $\tau(x)$ :

\boxed{ \begin{aligned} g_d(x) &= 1.10\tanh\big(0.06(\text{spend}-100)\big) + 1.00\sigma\big(0.60(\text{sessions}-5)\big) + 0.50\log\big(1+\text{tenure}\big) + 0.50\text{prem} + 0.25\text{urban} + 0.90\text{prem}\cdot\mathbb{\text{spend}>120} + 0.30\text{urban}\cdot\mathbb{\text{tenure}<12} +0.80\tau(x) \end{aligned} }

where $z_+ = \max(z,0)$ and $\mathbb{1}{\cdot}$ is the indicator.

Then

D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)

Outcome model

Y_i = \alpha_y + g_y(X_i) + D_i\tau(X_i) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2),

with $\alpha_y=0,\ \sigma_y=1$ .

The baseline outcome component is nonlinear:

\boxed{ \begin{aligned} g_y(x) &= 0.70\tanh\big(0.03(\text{spend}-80)\big) + 0.50\sqrt{\text{sessions}} + 0.40\log\big(1+\text{tenure}\big) + 0.30\text{prem} + 0.10\text{urban} - 0.10\mathbb{\text{spend}<20} \end{aligned} }

The heterogeneous treatment effect (CATE) is

\boxed{ \begin{aligned} \tau(x) &= 0.40 + 0.60\sigma\big(-0.40(\text{sessions}-5)\big) + 2.00\text{prem}\cdot\mathbb{\text{spend}>120} \ + 0.10\text{urban}\cdot\mathbb{\text{tenure}<12} \end{aligned} }

Oracle nuisances (IRM) and CATE

m(x) = \sigma\big(\alpha_d + g_d(x)\big)

g_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + g_y(x)

g_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + g_y(x) + \tau(x)

\mathrm{CATE}(x) = g_1(x) - g_0(x) = \tau(x)

Targets

\Theta_{\text{ATE}} = \mathbb{E}\big[\tau(X)\big], \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[\tau(X)\mid D=1\big].

Result

Treatment share ≈ 0.2036 Ground-truth ATE from the DGP: 0.913 Ground-truth ATT from the DGP: 1.567

Result

	y	d	tenure_months	avg_sessions_week	spend_last_month	premium_user	urban_resident
0	0.689404	0.0	12.130544	4.056687	181.570607	0.0	0.0
1	3.045282	0.0	19.586560	1.671561	182.793598	0.0	0.0
2	7.173595	1.0	39.455103	5.452889	125.185708	1.0	1.0
3	1.926216	0.0	26.327693	5.051629	4.932905	0.0	1.0
4	1.225088	0.0	35.042771	4.933996	23.577407	0.0	0.0

Result

Real ATE = 0.913 VS Estimated = 0.9917276396749556 in (0.869543879249174, 1.1139114001007373)

Result

Real ATTE = 1.567 VS Estimated = 1.6433239376940343 in (1.4972790851652928, 1.7893687902227757)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP