Causalis

`CausalDatasetGenerator`

Generate synthetic causal inference datasets with controllable confounding, treatment prevalence, noise, and (optionally) heterogeneous treatment effects.

Data model (high level)

confounders X ∈ R^k are drawn from user-specified distributions.
Binary treatment D is assigned by a logistic model: D ~ Bernoulli( sigmoid(alpha_d + f_d(X) + u_strength_d * U) ), where f_d(X) = (X @ beta_d + g_d(X)) * propensity_sharpness, and U ~ N(0,1) is an optional unobserved confounder.
Outcome Y depends on treatment and confounders with link determined by outcome_type: outcome_type = "continuous": Y = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) + ε, ε ~ N(0, sigma_y^2) outcome_type = "binary": logit P(Y=1|T,X,U) = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) outcome_type = "poisson": log E[Y|T,X,U] = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) outcome_type = "gamma": log E[Y|T,X,U] = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) where f_y(X) = X @ beta_y + g_y(X), and tau(X) is either constant theta or a user function.

Returned columns

y: outcome
d: binary treatment (0/1)
x1..xk (or user-provided names)
m: true propensity P(T=1 | X) marginalized over U
m_obs: realized propensity P(T=1 | X, U)
tau_link: tau(X) on the structural (link) scale
g0: E[Y | X, T=0] on the natural outcome scale marginalized over U .,9
g1: E[Y | X, T=1] on the natural outcome scale marginalized over U
cate: g1 - g0 (conditional average treatment effect on the natural outcome scale)

Notes on effect scale:

For "continuous", theta (or tau(X)) is an additive mean difference, so tau_link == cate.
For "binary", tau acts on the log-odds scale. cate is reported as a risk difference.
For "poisson" and "gamma", tau acts on the log-mean scale. cate is reported on the mean scale.

Parameters

theta (float) – Constant treatment effect used if tau is None.
tau (callable) – Function tau(X) -> array-like shape (n,) for heterogeneous effects.
beta_y (array - like) – Linear coefficients of confounders in the outcome baseline f_y(X).
beta_d (array - like) – Linear coefficients of confounders in the treatment score f_d(X).
g_y (callable) – Nonlinear/additive function g_y(X) -> (n,) added to the outcome baseline.
g_d (callable) – Nonlinear/additive function g_d(X) -> (n,) added to the treatment score.
alpha_y (float) – Outcome intercept (natural scale for continuous; log-odds for binary; log-mean for Poisson/Gamma).
alpha_d (float) – Treatment intercept (log-odds). If target_d_rate is set, alpha_d is auto-calibrated.
sigma_y (float) – Std. dev. of the Gaussian noise for continuous outcomes.
outcome_type (('continuous', 'binary', 'poisson', 'gamma', 'tweedie')) – Outcome family and link as defined above.
confounder_specs (list of dict) – Schema for generating confounders. See _gaussian_copula for details.
k (int) – Number of confounders when confounder_specs is None. Defaults to independent N(0,1).
x_sampler (callable) – Custom sampler (n, k, seed) -> X ndarray of shape (n,k). Overrides confounder_specs.
use_copula (bool) – If True and confounder_specs provided, use Gaussian copula for X.
copula_corr (array - like) – Correlation matrix for copula.
target_d_rate (float) – Target treatment prevalence (propensity mean). Calibrates alpha_d.
u_strength_d (float) – Strength of the unobserved confounder U in treatment assignment.
u_strength_y (float) – Strength of the unobserved confounder U in the outcome.
propensity_sharpness (float) – Scales the X-driven treatment score to adjust positivity difficulty.
seed (int) – Random seed for reproducibility.

Attributes

rng (Generator) – Internal RNG seeded from seed.

Functions

generate – Draw a synthetic dataset of size n.
oracle_nuisance – Return nuisance functions (m(x), g0(x), g1(x)) compatible with IRM.
to_causal_data – Generate a dataset and convert it to a CausalData object.

`alpha_d`

`alpha_y`

`alpha_zi`

`beta_d`

`beta_y`

`beta_zi`

`confounder_specs`

`copula_corr`

`g_d`

`g_y`

`g_zi`

`gamma_shape`

`generate`

Draw a synthetic dataset of size n.

Parameters

n (int) – Number of samples to generate.
U (ndarray) – Unobserved confounder. If None, generated from N(0,1).

Returns

DataFrame – The generated dataset with outcome 'y', treatment 'd', confounders, and oracle ground-truth columns.

`include_oracle`

`k`

`lognormal_sigma`

`oracle_nuisance`

Return nuisance functions (m(x), g0(x), g1(x)) compatible with IRM.

Parameters

num_quad (int) – Number of quadrature points for marginalizing over U.

Returns

dict – Dictionary of callables mapping X to nuisance values.

`outcome_type`

`pos_dist`

`propensity_sharpness`

`rng`

`score_bounding`

`seed`

`sigma_y`

`target_d_rate`

`tau`

`tau_zi`

`theta`

`to_causal_data`

Generate a dataset and convert it to a CausalData object.

Parameters

n (int) – Number of samples to generate.
confounders (str or list of str) – List of confounder column names to include. If None, automatically detects numeric confounders.

Returns

CausalData – A CausalData object containing the generated dataset.

CausalDatasetGenerator

CausalDatasetGenerator

Parameters

Attributes

Functions

alpha_d

alpha_y

alpha_zi

beta_d

beta_y

beta_zi

confounder_specs

copula_corr

g_d

g_y

g_zi

gamma_shape

generate

Parameters

Returns

include_oracle

k

lognormal_sigma

oracle_nuisance

Parameters

Returns

outcome_type

pos_dist

propensity_sharpness

rng

score_bounding

seed

sigma_y

target_d_rate

tau

tau_zi

theta

to_causal_data

Parameters

Returns

u_strength_d

u_strength_y

u_strength_zi

use_copula

x_sampler

`CausalDatasetGenerator`

`alpha_d`

`alpha_y`

`alpha_zi`

`beta_d`

`beta_y`

`beta_zi`

`confounder_specs`

`copula_corr`

`g_d`

`g_y`

`g_zi`

`gamma_shape`

`generate`

`include_oracle`

`k`

`lognormal_sigma`

`oracle_nuisance`

`outcome_type`

`pos_dist`

`propensity_sharpness`

`rng`

`score_bounding`

`seed`

`sigma_y`

`target_d_rate`

`tau`

`tau_zi`

`theta`

`to_causal_data`

`u_strength_d`

`u_strength_y`

`u_strength_zi`

`use_copula`

`x_sampler`