Skip to content
Research2 min read

generate_obs_hte_26()

Automated conversion of generate_obs_hte_26.ipynb

generate_obs_hte_26()

Mathematical Specification of generate_obs_hte_26()

The generate_obs_hte_26() function generates an observational dataset with nonlinear outcome and treatment assignment mechanisms, along with heterogeneous treatment effects. The data generation process (DGP) is defined as follows:

1. Confounders (XX)

The dataset contains five confounders X=(X1,X2,X3,X4,X5)TX = (X_1, X_2, X_3, X_4, X_5)^T:

  • X1X_1: tenure_months
  • X2X_2: avg_sessions_week
  • X3X_3: spend_last_month
  • X4X_4: premium_user
  • X5X_5: urban_resident

Base Features Sampling: The base features X1,X2,X3,X5X_1, X_2, X_3, X_5 are sampled using a Gaussian Copula to introduce correlations while preserving specific marginal distributions. The correlation matrix for the underlying Gaussian variables is: Σ=(1.00.30.20.00.31.00.40.00.20.41.00.00.00.00.01.0)\Sigma = \begin{pmatrix} 1.0 & 0.3 & 0.2 & 0.0 \\ 0.3 & 1.0 & 0.4 & 0.0 \\ 0.2 & 0.4 & 1.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 1.0 \end{pmatrix}

The marginal distributions are:

  • X1Lognormal(μ=ln24,σ=0.6)X_1 \sim \text{Lognormal}(\mu = \ln 24, \sigma = 0.6), clipped at [0,120][0, 120].
  • X2NegativeBinomial(mean=5,dispersion=0.5)X_2 \sim \text{NegativeBinomial}(\text{mean} = 5, \text{dispersion} = 0.5), clipped at [0,40][0, 40].
  • X3Lognormal(μ=ln60,σ=0.9)X_3 \sim \text{Lognormal}(\mu = \ln 60, \sigma = 0.9), clipped at [0,500][0, 500].
  • X5Bernoulli(p=0.60)X_5 \sim \text{Bernoulli}(p = 0.60).

Derived Feature (X4X_4): The feature X4X_4 (premium_user) is generated based on a logistic model of the other features: logit(P(X4=1X))=5.0+0.7ln(1+X2)+0.5ln(1+X3)+0.01X1\text{logit}(P(X_4 = 1 | X)) = -5.0 + 0.7 \ln(1 + X_2) + 0.5 \ln(1 + X_3) + 0.01 X_1 X4Bernoulli(P(X4=1X))X_4 \sim \text{Bernoulli}(P(X_4 = 1 | X))

2. Treatment Assignment (DD)

The treatment D{0,1}D \in \{0, 1\} is assigned using a propensity score e(X)e(X): P(D=1X)=σ(αd+bound(fd(X),2.0))P(D=1|X) = \sigma\left( \alpha_d + \text{bound}\left( f_d(X), 2.0 \right) \right) where:

  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.
  • bound(z,c)=ctanh(z/c)\text{bound}(z, c) = c \cdot \tanh(z/c) limits the propensity score range to ensure positivity (here c=2.0c=2.0).
  • αd\alpha_d is a calibration constant chosen such that the overall treatment rate is approximately 35%.
  • The score fd(X)f_d(X) is defined as: fd(X)=0.005X1+0.8X4+0.25X5+0.8tanh(ln1+X361)+0.2ln1+X26tanh(X1241)+0.4X4(X50.5)f_d(X) = 0.005 X_1 + 0.8 X_4 + 0.25 X_5 + 0.8 \tanh\left(\ln\frac{1+X_3}{61}\right) + 0.2 \ln\frac{1+X_2}{6} \tanh\left(\frac{X_1}{24} - 1\right) + 0.4 X_4 (X_5 - 0.5)

3. Heterogeneous Treatment Effect (τ(X)\tau(X))

The treatment effect (CATE) is nonlinear and depends on the confounders: τ(X)=clip(1.2+0.6tanh(ln1+X26)+0.4X40.5tanh(X148)+0.2X5tanh(ln1+X361),0.1,3.0)\tau(X) = \text{clip}\left( 1.2 + 0.6 \tanh\left(\ln\frac{1+X_2}{6}\right) + 0.4 X_4 - 0.5 \tanh\left(\frac{X_1}{48}\right) + 0.2 X_5 \tanh\left(\ln\frac{1+X_3}{61}\right), 0.1, 3.0 \right)

4. Outcome Model (YY)

The outcome is a continuous variable generated as: Y=fy(X)+Dτ(X)+ϵ,ϵN(0,3.52)Y = f_y(X) + D \cdot \tau(X) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 3.5^2) The baseline outcome function fy(X)f_y(X) includes linear and nonlinear terms: fy(X)=0.01X1+1.2X4+0.6X5+gy(X)f_y(X) = 0.01 X_1 + 1.2 X_4 + 0.6 X_5 + g_y(X) where: gy(X)=1.5tanh(X124)+0.5(ln1+X26)2+0.2ln1+X361ln1+X26+0.5X4ln1+X26+0.8X5tanh(12ln1+X361)g_y(X) = 1.5 \tanh\left(\frac{X_1}{24}\right) + 0.5 \left(\ln\frac{1+X_2}{6}\right)^2 + 0.2 \ln\frac{1+X_3}{61} \ln\frac{1+X_2}{6} + 0.5 X_4 \ln\frac{1+X_2}{6} + 0.8 X_5 \tanh\left(\frac{1}{2} \ln\frac{1+X_3}{61}\right)