DML IRM vs Propensity Score Weighting (PSW)

Causal Inference consists of two main parts: Identification Assumptions and Model Specification. SUTVA, Unconfoundedness, Overlap are strong assumptions that must be true to call our inference causal. Studies and quasi experiments often have problems with Identification Assumptions so in practice you spend time to prove them, not model specification

However, in this notebook I will focus on model specification. Propensity Score Weighting is a classical baseline for observational studies, estimating ATTE from an estimated propensity model. It should perform worse than the DML IRM approach because:

uses only the propensity model and does not learn a separate outcome regression
is more sensitive to extreme or misscaled propensities because errors become large weights
does not use orthogonalization, so nuisance-model misspecification shows up directly in the estimate
does not use cross-fitting to control overfitting bias from flexible ML models
often has much worse effective sample size than the nominal row count when overlap is weak
provides a weaker default benchmark than a doubly robust IRM when both treatment and outcome are nonlinear

We will compare absolute estimates on DGPs from Causalis between the IRM DML model implemented in Causalis and a hand-written PSW estimator.

generate_obs_hte_26_rich()

Result

Running n=10,000 ... Running n=100,000 ... Running n=1,000,000 ...

	n	ground_truth_atte	irm_atte	psw_atte	irm_abs_error	psw_abs_error	irm_runtime_sec	psw_runtime_sec	psw_control_ess	psw_max_weight
0	10000	11.454404	6.256087	5.451640	5.198317	6.002764	7.732994	3.222296	5339.475134	0.388524
1	100000	10.914991	12.106856	11.189699	1.191864	0.274708	47.719494	8.852210	61532.484911	0.546033
2	1000000	11.028129	10.340542	10.062838	0.687587	0.965291	339.667114	36.398603	622717.546229	1.203364

Result

n=10,000: ground truth ATTE=11.454404, IRM ATTE=6.256087, PSW ATTE=5.451640, PSW control ESS=5339.5 n=100,000: ground truth ATTE=10.914991, IRM ATTE=12.106856, PSW ATTE=11.189699, PSW control ESS=61532.5 n=1,000,000: ground truth ATTE=11.028129, IRM ATTE=10.340542, PSW ATTE=10.062838, PSW control ESS=622717.5

DML IRM should be more stable than PSW on the rich nonlinear DGP, especially when PSW loses effective sample size through large ATT weights.

generate_obs_hte_binary_26()

Result

Running n=10,000 ... Running n=100,000 ... Running n=1,000,000 ...

	n	ground_truth_atte	irm_atte	psw_atte	irm_abs_error	psw_abs_error	irm_runtime_sec	psw_runtime_sec	psw_control_ess	psw_max_weight
0	10000	0.103885	0.102344	0.113561	0.001541	0.009676	4.270595	1.746638	5777.856137	0.838468
1	100000	0.101238	0.103547	0.106411	0.002309	0.005173	24.387916	4.752561	61172.870093	1.230113
2	1000000	0.101411	0.103282	0.104150	0.001871	0.002739	215.242940	33.692571	623116.282826	1.637482

Result

n=10,000: ground truth ATTE=0.103885, IRM ATTE=0.102344, PSW ATTE=0.113561, PSW control ESS=5777.9 n=100,000: ground truth ATTE=0.101238, IRM ATTE=0.103547, PSW ATTE=0.106411, PSW control ESS=61172.9 n=1,000,000: ground truth ATTE=0.101411, IRM ATTE=0.103282, PSW ATTE=0.104150, PSW control ESS=623116.3

Conclusion

I recommend using DML IRM as the default model specification for the unconfoundedness scenario, and treating PSW as a simple benchmark rather than the default estimator.