Skip to content
Research2 min read

DML IRM vs Propensity Score Weighting (PSW)

Causal Inference consists of two main parts: Identification Assumptions and Model Specification. SUTVA, Unconfoundedness, Overlap are strong assumptions that mu...

DML IRM vs Propensity Score Weighting (PSW)

Causal Inference consists of two main parts: Identification Assumptions and Model Specification. SUTVA, Unconfoundedness, Overlap are strong assumptions that must be true to call our inference causal. Studies and quasi experiments often have problems with Identification Assumptions so in practice you spend time to prove them, not model specification

However, in this notebook I will focus on model specification. Propensity Score Weighting is a classical baseline for observational studies, estimating ATTE from an estimated propensity model. It should perform worse than the DML IRM approach because:

  • uses only the propensity model and does not learn a separate outcome regression
  • is more sensitive to extreme or misscaled propensities because errors become large weights
  • does not use orthogonalization, so nuisance-model misspecification shows up directly in the estimate
  • does not use cross-fitting to control overfitting bias from flexible ML models
  • often has much worse effective sample size than the nominal row count when overlap is weak
  • provides a weaker default benchmark than a doubly robust IRM when both treatment and outcome are nonlinear

We will compare absolute estimates on DGPs from Causalis between the IRM DML model implemented in Causalis and a hand-written PSW estimator.

generate_obs_hte_26_rich()

Read more about dgp at https://causalis.causalcraft.com/articles/generate_obs_hte_26_rich

Result

Running n=10,000 ... Running n=100,000 ... Running n=1,000,000 ...

nground_truth_atteirm_attepsw_atteirm_abs_errorpsw_abs_errorirm_runtime_secpsw_runtime_secpsw_control_esspsw_max_weight
01000011.4544046.2560875.4516405.1983176.0027647.7329943.2222965339.4751340.388524
110000010.91499112.10685611.1896991.1918640.27470847.7194948.85221061532.4849110.546033
2100000011.02812910.34054210.0628380.6875870.965291339.66711436.398603622717.5462291.203364
Result

n=10,000: ground truth ATTE=11.454404, IRM ATTE=6.256087, PSW ATTE=5.451640, PSW control ESS=5339.5 n=100,000: ground truth ATTE=10.914991, IRM ATTE=12.106856, PSW ATTE=11.189699, PSW control ESS=61532.5 n=1,000,000: ground truth ATTE=11.028129, IRM ATTE=10.340542, PSW ATTE=10.062838, PSW control ESS=622717.5

DML IRM should be more stable than PSW on the rich nonlinear DGP, especially when PSW loses effective sample size through large ATT weights.

generate_obs_hte_binary_26()

Read more about the dgp at https://causalis.causalcraft.com/articles/generate_obs_hte_binary_26

Result

Running n=10,000 ... Running n=100,000 ... Running n=1,000,000 ...

nground_truth_atteirm_attepsw_atteirm_abs_errorpsw_abs_errorirm_runtime_secpsw_runtime_secpsw_control_esspsw_max_weight
0100000.1038850.1023440.1135610.0015410.0096764.2705951.7466385777.8561370.838468
11000000.1012380.1035470.1064110.0023090.00517324.3879164.75256161172.8700931.230113
210000000.1014110.1032820.1041500.0018710.002739215.24294033.692571623116.2828261.637482
Result

n=10,000: ground truth ATTE=0.103885, IRM ATTE=0.102344, PSW ATTE=0.113561, PSW control ESS=5777.9 n=100,000: ground truth ATTE=0.101238, IRM ATTE=0.103547, PSW ATTE=0.106411, PSW control ESS=61172.9 n=1,000,000: ground truth ATTE=0.101411, IRM ATTE=0.103282, PSW ATTE=0.104150, PSW control ESS=623116.3

Conclusion

I recommend using DML IRM as the default model specification for the unconfoundedness scenario, and treating PSW as a simple benchmark rather than the default estimator.