Research4 min read

generate_obs_hte_binary_26()

Automated conversion of generate_obs_hte_binary_26.ipynb

generate_obs_hte_binary_26()

The generate_obs_hte_binary_26() function provides a richer observational dataset with binary outcomes, nonlinear confounding, and heterogeneous treatment effects. It uses 11 confounders with behavior-driven derived features and a moderate target treatment rate.

1. Confounders (XX)

The dataset contains 11 confounders X=(X1,,X11)TX = (X_1, \dots, X_{11})^T:

  • X1X_1: tenure_months
  • X2X_2: avg_sessions_week
  • X3X_3: spend_last_month
  • X4X_4: age_years
  • X5X_5: prior_purchases_12m
  • X6X_6: support_tickets_90d
  • X7X_7: premium_user
  • X8X_8: mobile_user
  • X9X_9: weekend_user
  • X10X_{10}: email_opt_in
  • X11X_{11}: referred_user

Base Features Sampling: The base features X1,X2,X3,X4,X9X_1, X_2, X_3, X_4, X_9 are sampled using a Gaussian Copula with correlation matrix:

Σ=(1.000.250.200.250.050.251.000.450.200.300.200.451.000.100.200.250.200.101.000.150.050.300.200.151.00)\Sigma = \begin{pmatrix} 1.00 & 0.25 & 0.20 & 0.25 & 0.05 \\ 0.25 & 1.00 & 0.45 & -0.20 & 0.30 \\ 0.20 & 0.45 & 1.00 & -0.10 & 0.20 \\ 0.25 & -0.20 & -0.10 & 1.00 & -0.15 \\ 0.05 & 0.30 & 0.20 & -0.15 & 1.00 \end{pmatrix}

Derived Features: The remaining confounders are generated from behavior-linked models:

  • Premium User (X7X_7): logit(P(X7=1))=4.6+0.70ln(1+X2)+0.45ln(1+X3)+0.012X1+0.20X9\text{logit}(P(X_7=1)) = -4.6 + 0.70 \ln(1+X_2) + 0.45 \ln(1+X_3) + 0.012 X_1 + 0.20 X_9
  • Mobile User (X8X_8): logit(P(X8=1))=1.40.03(X435)+0.25X9+0.22ln(1+X2)\text{logit}(P(X_8=1)) = 1.4 - 0.03(X_4 - 35) + 0.25 X_9 + 0.22 \ln(1+X_2)
  • Referred User (X11X_{11}): logit(P(X11=1))=1.30.02(X112)+0.42X8+0.25X9\text{logit}(P(X_{11}=1)) = -1.3 - 0.02(X_1 - 12) + 0.42 X_8 + 0.25 X_9
  • Email Opt-in (X10X_{10}): logit(P(X10=1))=0.6+0.50X8+0.35X7+0.20X9+0.12ln(1+X2)\text{logit}(P(X_{10}=1)) = -0.6 + 0.50 X_8 + 0.35 X_7 + 0.20 X_9 + 0.12 \ln(1+X_2)
  • Prior Purchases (X5X_5): X5Poisson(λ=clip(0.9+0.52ln(1+X2)+0.40ln(1+X3)+0.24X7+0.12X10,0.05,25.0))X_5 \sim \text{Poisson}(\lambda = \text{clip}(0.9 + 0.52 \ln(1+X_2) + 0.40 \ln(1+X_3) + 0.24 X_7 + 0.12 X_{10}, 0.05, 25.0))
  • Support Tickets (X6X_6): X6Poisson(λ=clip(0.55+0.22ln(1+X2)+0.32(1X7)+0.18(1X10)+0.12tanh(X44512),0.05,10.0))X_6 \sim \text{Poisson}(\lambda = \text{clip}(0.55 + 0.22 \ln(1+X_2) + 0.32(1-X_7) + 0.18(1-X_{10}) + 0.12 \tanh(\frac{X_4 - 45}{12}), 0.05, 10.0))

2. Treatment Assignment (DD)

The treatment D{0,1}D \in \{0, 1\} is assigned with a target rate of 15%: P(D=1X)=σ(αd+bound(fd(X),2.0))P(D=1|X) = \sigma\left( \alpha_d + \text{bound}(f_d(X), 2.0) \right) where fd(X)=j=111βd,jXj+gd(X)f_d(X) = \sum_{j=1}^{11} \beta_{d,j} X_j + g_d(X) and gd(X)=0.85tanh(ln1+X361)+0.24ln1+X26tanh(X1241)+0.30X7(X90.5)+0.22X10tanh(ln1+X24)+0.25X11(1tanhX136)+0.14tanh(ln1+X62.5)0.10tanh(X44512).g_d(X) = 0.85\tanh\left(\ln\frac{1+X_3}{61}\right) + 0.24\ln\frac{1+X_2}{6}\tanh\left(\frac{X_1}{24}-1\right) + 0.30X_7(X_9-0.5) + 0.22X_{10}\tanh\left(\ln\frac{1+X_2}{4}\right) + 0.25X_{11}\left(1-\tanh\frac{X_1}{36}\right) + 0.14\tanh\left(\ln\frac{1+X_6}{2.5}\right) - 0.10\tanh\left(\frac{X_4-45}{12}\right).

3. Heterogeneous Treatment Effect (τ(X)\tau(X))

For this binary-outcome scenario, treatment effect is on the log-odds scale: τ(X)=0.30+0.40tanh(ln1+X26)+0.45X7+0.12X8+0.10X10+0.12X110.32tanhX1480.18tanhX44015+0.10X9tanh(ln1+X361)0.12tanh(ln1+X62.5).\tau(X) = 0.30 + 0.40\tanh\left(\ln\frac{1+X_2}{6}\right) + 0.45X_7 + 0.12X_8 + 0.10X_{10} + 0.12X_{11} - 0.32\tanh\frac{X_1}{48} - 0.18\tanh\frac{X_4-40}{15} + 0.10X_9\tanh\left(\ln\frac{1+X_3}{61}\right) - 0.12\tanh\left(\ln\frac{1+X_6}{2.5}\right). The treatment effect is clipped to [0.35,1.4][-0.35, 1.4].

4. Outcome Model (YY)

The outcome is binary with logistic link: P(Y=1X,D)=σ(αy+j=111βy,jXj+gy(X)+Dτ(X)),P(Y=1\mid X,D) = \sigma\left(\alpha_y + \sum_{j=1}^{11}\beta_{y,j}X_j + g_y(X) + D\cdot\tau(X)\right), with αy=1.7\alpha_y = -1.7.

The nonlinear baseline component is: gy(X)=1.1tanhX124+0.50(ln1+X26)2+0.22ln1+X361ln1+X260.35ln(1+X6)+0.18ln(1+X5)tanhX118+0.22X8ln1+X26+0.20X7(X90.5)+0.18X10tanh(12ln1+X361)+0.14X11tanh(X1121)0.20tanhX44015.g_y(X) = 1.1\tanh\frac{X_1}{24} + 0.50\left(\ln\frac{1+X_2}{6}\right)^2 + 0.22\ln\frac{1+X_3}{61}\ln\frac{1+X_2}{6} - 0.35\ln(1+X_6) + 0.18\ln(1+X_5)\tanh\frac{X_1}{18} + 0.22X_8\ln\frac{1+X_2}{6} + 0.20X_7(X_9-0.5) + 0.18X_{10}\tanh\left(\frac{1}{2}\ln\frac{1+X_3}{61}\right) + 0.14X_{11}\tanh\left(\frac{X_1}{12}-1\right) - 0.20\tanh\frac{X_4-40}{15}.

Oracle CATE is reported on the natural scale as a risk difference (g1 - g0).

Result
ydtenure_monthsavg_sessions_weekspend_last_monthage_yearsprior_purchases_12msupport_tickets_90dpremium_usermobile_userweekend_useremail_opt_inreferred_usermm_obstau_linkg0g1cate
00.00.028.8146541.078.45942350.3924904.02.00.01.01.01.00.00.1368040.136804-0.0756900.2595860.245305-0.014281
11.01.010.9873673.038.65269831.6526663.00.01.01.01.00.00.00.1575990.1575990.7814290.5923250.7604250.168101
20.01.040.6782129.098.95076048.6340554.05.00.01.00.00.00.00.1654010.1654010.2095180.0438620.0535380.009676
30.01.014.3317645.027.38658842.5026413.03.01.01.01.00.00.00.1588970.1588970.6304570.1483910.2466020.098211
40.01.021.4803042.0119.75396035.3113823.00.00.01.01.01.00.00.1699430.1699430.3463840.5270430.6117480.084704
Result

Ground truth ATE is 0.08155183943650529 Ground truth ATTE is 0.10123794590017934

Result

CausalData(df=(100000, 13), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'age_years', 'prior_purchases_12m', 'support_tickets_90d', 'premium_user', 'mobile_user', 'weekend_user', 'email_opt_in', 'referred_user'])

Result
treatmentcountmeanstdminp10p25medianp75p90max
00.0850670.4377260.4961100.00.00.00.01.01.01.0
11.0149330.5793880.4936740.00.00.01.01.01.01.0
Result

png

Result
confoundersmean_d_0mean_d_1abs_diffsmdks_pvalue
0spend_last_month83.486832114.09373630.6069050.3361400.00000
1premium_user0.2373780.3713250.1339480.2942340.00000
2avg_sessions_week4.8234106.0050891.1816800.2713470.00000
3prior_purchases_12m3.4206333.8023170.3816840.1908480.00000
4referred_user0.2431970.3100520.0668550.1498700.00000
5age_years36.59462235.0693041.525318-0.1363500.00000
6email_opt_in0.5424900.5936520.0511620.1034210.00000
7mobile_user0.8512700.8826760.0314060.0925750.00000
8support_tickets_90d1.1419821.2435550.1015720.0917420.00000
9tenure_months28.33838729.8486521.5102650.0815310.00000
10weekend_user0.5453700.5757720.0304020.0612820.00000