Linear and Nonlinear Data Generating Process benchmarking

Linear and Nonlinear Data Generating Process benchmarking

Linear Data Generating Process (DGP)

Let i=1,,ni=1,\dots,n. Draw confounders:

tenureiN(24, 122)\text{tenure}_i \sim \mathcal N(24,\ 12^2)sessionsiN(5, 22)\text{sessions}_i \sim \mathcal N(5,\ 2^2)spendiUnif(0,200)\text{spend}_i \sim \mathrm{Unif}(0,200)premiBernoulli(0.25)\text{prem}_i \sim \mathrm{Bernoulli}(0.25)urbaniBernoulli(0.60)\text{urban}_i \sim \mathrm{Bernoulli}(0.60)

Stack them as

Xi:=[ tenurei, sessionsi, spendi, premi, urbani ]R5X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5

Treatment model

m(x)Pr(D=1X=x)=σ(αd+xβt),σ(z)=11+ez,m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + x^\top \beta_t\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}},

with αd\alpha_d calibrated (by bisection) so that \mathbb{E}[D]\approx 0.20, and

βt=[,0.08, 0.12, 0.004, 0.25, 0.10]\beta_t^\top = \big[, 0.08,\ 0.12,\ 0.004,\ 0.25,\ 0.10 \big]

Then D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)


Outcome model

Yi=αy+Xiβy+ΘDi+εi,εiN(0,σy2),Y_i = \alpha_y + X_i^\top \beta_y + \Theta D_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2),

with αy=0, σy=1, Θ=0.80\alpha_y=0,\ \sigma_y=1,\ \Theta=0.80, and

βy=[0.05, 0.60, 0.005, 0.80, 0.20]\beta_y^\top = \big[0.05,\ 0.60,\ 0.005,\ 0.80,\ 0.20 \big]

Oracle nuisances (IRM) and CATE

m(x)=σ ⁣(αd+xβt)m(x) = \sigma\!\big(\alpha_d + x^\top \beta_t\big)g0(x)=E[YX=x,D=0]=αy+xβyg_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + x^\top \beta_yg1(x)=E[YX=x,D=1]=αy+xβy+Θg_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + x^\top \beta_y + \ThetaCATE(x)=g1(x)g0(x)=Θ(constant)\mathrm{CATE}(x) = g_1(x) - g_0(x) = \Theta \quad \text{(constant)}

Targets

ΘATE=E[Y(1)Y(0)]=Θ,ΘATTE=E[Y(1)Y(0)D=1]=Θ.\Theta_{\text{ATE}} = \mathbb{E}\big[Y(1)-Y(0)\big] = \Theta, \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[Y(1)-Y(0)\mid D=1\big] = \Theta.

So under this constant-effect DGP:

ΘATE=ΘATTE=0.80.\Theta_{\text{ATE}} = \Theta_{\text{ATTE}} = 0.80.

Let's generate the Data

Result

Treatment share ≈ 0.2052 Ground-truth ATE from the DGP: 0.800 Ground-truth ATT from the DGP: 0.800

Wrap it in CausalData Object

Result
ydtenure_monthsavg_sessions_weekspend_last_monthpremium_userurban_resident
01.9039100.012.1305444.056687181.5706070.00.0
13.3881440.019.5865601.671561182.7935980.00.0
28.4565121.039.4551035.452889125.1857081.01.0
35.5359701.026.3276935.0516294.9329050.01.0
44.9651401.035.0427714.93399623.5774070.00.0

Estimate ATE and ATTE

Result

Real ATE = 0.8 VS Estimated = 0.7506313322797796 in (0.6668061755942251, 0.8344564889653342)

Result

Real ATTE = 0.8 VS Estimated = 0.8238669785039335 in (0.7558326760362799, 0.8919012809715872)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP

Nonlinear Data Generating Process (DGP)

Let i=1,,ni=1,\dots,n. Draw confounders:

tenureiN(24, 122)\text{tenure}_i \sim \mathcal N(24,\ 12^2)sessionsiN(5, 22)\text{sessions}_i \sim \mathcal N(5,\ 2^2)spendiUnif(0,200)\text{spend}_i \sim \mathrm{Unif}(0,200)premiBernoulli(0.25)\text{prem}_i \sim \mathrm{Bernoulli}(0.25)urbaniBernoulli(0.60)\text{urban}_i \sim \mathrm{Bernoulli}(0.60)

Stack them as

Xi:=[ tenurei, sessionsi, spendi, premi, urbani ]R5X_i := \big[\ \text{tenure}_i,\ \text{sessions}_i,\ \text{spend}_i,\ \text{prem}_i,\ \text{urban}_i\ \big]^\top \in \mathbb{R}^5

Treatment model

Define

m(x)Pr(D=1X=x)=σ(αd+gd(x)),σ(z)=11+ez,m(x) \equiv \Pr(D=1\mid X=x) = \sigma\big(\alpha_d + g_d(x)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}},

with αd\alpha_d calibrated (by bisection) so that \mathbb{E}[D]\approx 0.20.

The nonlinear score includes alignment with the treatment effect τ(x)\tau(x):

gd(x)=1.10tanh(0.06(spend100))+1.00σ(0.60(sessions5))+0.50log(1+tenure)+0.50prem+0.25urban+0.90premspend>120+0.30urbantenure<12+0.80τ(x)\boxed{ \begin{aligned} g_d(x) &= 1.10\tanh\big(0.06(\text{spend}-100)\big) + 1.00\sigma\big(0.60(\text{sessions}-5)\big) + 0.50\log\big(1+\text{tenure}\big) + 0.50\text{prem} + 0.25\text{urban} + 0.90\text{prem}\cdot\mathbb{\text{spend}>120} + 0.30\text{urban}\cdot\mathbb{\text{tenure}<12} +0.80\tau(x) \end{aligned} }

where z+=max(z,0)z_+ = \max(z,0) and \mathbb&#123;1&#125;&#123;\cdot&#125; is the indicator.

Then

DiXiBernoulli(m(Xi))D_i \mid X_i \sim \mathrm{Bernoulli}\big(m(X_i)\big)

Outcome model

Yi=αy+gy(Xi)+Diτ(Xi)+εi,εiN(0,σy2),Y_i = \alpha_y + g_y(X_i) + D_i\tau(X_i) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal N(0,\sigma_y^2),

with αy=0, σy=1\alpha_y=0,\ \sigma_y=1.

The baseline outcome component is nonlinear:

gy(x)=0.70tanh(0.03(spend80))+0.50sessions+0.40log(1+tenure)+0.30prem+0.10urban0.10spend<20\boxed{ \begin{aligned} g_y(x) &= 0.70\tanh\big(0.03(\text{spend}-80)\big) + 0.50\sqrt{\text{sessions}} + 0.40\log\big(1+\text{tenure}\big) + 0.30\text{prem} + 0.10\text{urban} - 0.10\mathbb{\text{spend}<20} \end{aligned} }

The heterogeneous treatment effect (CATE) is

τ(x)=0.40+0.60σ(0.40(sessions5))+2.00premspend>120 +0.10urbantenure<12\boxed{ \begin{aligned} \tau(x) &= 0.40 + 0.60\sigma\big(-0.40(\text{sessions}-5)\big) + 2.00\text{prem}\cdot\mathbb{\text{spend}>120} \ + 0.10\text{urban}\cdot\mathbb{\text{tenure}<12} \end{aligned} }

Oracle nuisances (IRM) and CATE

m(x)=σ(αd+gd(x))m(x) = \sigma\big(\alpha_d + g_d(x)\big)g0(x)=E[YX=x,D=0]=αy+gy(x)g_0(x) = \mathbb{E}[Y \mid X=x, D=0] = \alpha_y + g_y(x)g1(x)=E[YX=x,D=1]=αy+gy(x)+τ(x)g_1(x) = \mathbb{E}[Y \mid X=x, D=1] = \alpha_y + g_y(x) + \tau(x)CATE(x)=g1(x)g0(x)=τ(x)\mathrm{CATE}(x) = g_1(x) - g_0(x) = \tau(x)

Targets

ΘATE=E[τ(X)],ΘATTE=E[τ(X)D=1].\Theta_{\text{ATE}} = \mathbb{E}\big[\tau(X)\big], \qquad \Theta_{\text{ATTE}} = \mathbb{E}\big[\tau(X)\mid D=1\big].
Result

Treatment share ≈ 0.2036 Ground-truth ATE from the DGP: 0.913 Ground-truth ATT from the DGP: 1.567

Result
ydtenure_monthsavg_sessions_weekspend_last_monthpremium_userurban_resident
00.6894040.012.1305444.056687181.5706070.00.0
13.0452820.019.5865601.671561182.7935980.00.0
27.1735951.039.4551035.452889125.1857081.01.0
31.9262160.026.3276935.0516294.9329050.01.0
41.2250880.035.0427714.93399623.5774070.00.0
Result

Real ATE = 0.913 VS Estimated = 0.9917276396749556 in (0.869543879249174, 1.1139114001007373)

Result

Real ATTE = 1.567 VS Estimated = 1.6433239376940343 in (1.4972790851652928, 1.7893687902227757)

The estimated ATE and ATTE are close to the ground truth values. But we got wide confidence_intervals. Adding more data makes the intervals narrower. You can try it when changes the n in GDP