generate_obs_hte_26()

Mathematical Specification of `generate_obs_hte_26()`

The generate_obs_hte_26() function generates an observational dataset with nonlinear outcome and treatment assignment mechanisms, along with heterogeneous treatment effects. The data generation process (DGP) is defined as follows:

1. Confounders ( $X$ )

The dataset contains five confounders $X = (X_1, X_2, X_3, X_4, X_5)^T$ :

$X_1$ : tenure_months
$X_2$ : avg_sessions_week
$X_3$ : spend_last_month
$X_4$ : premium_user
$X_5$ : urban_resident

Base Features Sampling: The base features $X_1, X_2, X_3, X_5$ are sampled using a Gaussian Copula to introduce correlations while preserving specific marginal distributions. The correlation matrix for the underlying Gaussian variables is: $\Sigma = \begin{pmatrix} 1.0 & 0.3 & 0.2 & 0.0 \\ 0.3 & 1.0 & 0.4 & 0.0 \\ 0.2 & 0.4 & 1.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 1.0 \end{pmatrix}$

The marginal distributions are:

$X_1 \sim \text{Lognormal}(\mu = \ln 24, \sigma = 0.6)$ , clipped at $[0, 120]$ .
$X_2 \sim \text{NegativeBinomial}(\text{mean} = 5, \text{dispersion} = 0.5)$ , clipped at $[0, 40]$ .
$X_3 \sim \text{Lognormal}(\mu = \ln 60, \sigma = 0.9)$ , clipped at $[0, 500]$ .
$X_5 \sim \text{Bernoulli}(p = 0.60)$ .

Derived Feature ( $X_4$ ): The feature $X_4$ (premium_user) is generated based on a logistic model of the other features: $\text{logit}(P(X_4 = 1 | X)) = -5.0 + 0.7 \ln(1 + X_2) + 0.5 \ln(1 + X_3) + 0.01 X_1$ $X_4 \sim \text{Bernoulli}(P(X_4 = 1 | X))$

2. Treatment Assignment ( $D$ )

The treatment $D \in \{0, 1\}$ is assigned using a propensity score $e(X)$ : $P(D=1|X) = \sigma\left( \alpha_d + \text{bound}\left( f_d(X), 2.0 \right) \right)$ where:

$\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.
$\text{bound}(z, c) = c \cdot \tanh(z/c)$ limits the propensity score range to ensure positivity (here $c=2.0$ ).
$\alpha_d$ is a calibration constant chosen such that the overall treatment rate is approximately 35%.
The score $f_d(X)$ is defined as: $f_d(X) = 0.005 X_1 + 0.8 X_4 + 0.25 X_5 + 0.8 \tanh\left(\ln\frac{1+X_3}{61}\right) + 0.2 \ln\frac{1+X_2}{6} \tanh\left(\frac{X_1}{24} - 1\right) + 0.4 X_4 (X_5 - 0.5)$

3. Heterogeneous Treatment Effect ( $\tau(X)$ )

The treatment effect (CATE) is nonlinear and depends on the confounders: $\tau(X) = \text{clip}\left( 1.2 + 0.6 \tanh\left(\ln\frac{1+X_2}{6}\right) + 0.4 X_4 - 0.5 \tanh\left(\frac{X_1}{48}\right) + 0.2 X_5 \tanh\left(\ln\frac{1+X_3}{61}\right), 0.1, 3.0 \right)$

4. Outcome Model ( $Y$ )

The outcome is a continuous variable generated as: $Y = f_y(X) + D \cdot \tau(X) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 3.5^2)$ The baseline outcome function $f_y(X)$ includes linear and nonlinear terms: $f_y(X) = 0.01 X_1 + 1.2 X_4 + 0.6 X_5 + g_y(X)$ where: $g_y(X) = 1.5 \tanh\left(\frac{X_1}{24}\right) + 0.5 \left(\ln\frac{1+X_2}{6}\right)^2 + 0.2 \ln\frac{1+X_3}{61} \ln\frac{1+X_2}{6} + 0.5 X_4 \ln\frac{1+X_2}{6} + 0.8 X_5 \tanh\left(\frac{1}{2} \ln\frac{1+X_3}{61}\right)$

generate_obs_hte_26()

generate_obs_hte_26()

Mathematical Specification of generate_obs_hte_26()

1. Confounders (XXX)

2. Treatment Assignment (DDD)

3. Heterogeneous Treatment Effect (τ(X)\tau(X)τ(X))

4. Outcome Model (YYY)

Mathematical Specification of `generate_obs_hte_26()`

1. Confounders ( $X$ )

2. Treatment Assignment ( $D$ )

3. Heterogeneous Treatment Effect ( $\tau(X)$ )

4. Outcome Model ( $Y$ )