Benchmarking: DiffInMeans vs Scipy implementation

This notebook presents the benchmarking diff in means research workflow and key analysis steps.

Goal: verify that DiffInMeans implementations of ttest, conversion_ztest, and bootstrap match raw reference calculations from SciPy/Statsmodels.

Result

n_control=4955, n_treated=5045 control mean=0.198991, treated mean=0.232904

1) Welch t-test: model vs SciPy/Statsmodels

Result

	metric	DiffInMeans	External
0	p_value	0.00003740	0.00003740
1	absolute_ci_low	0.01779692	0.01779692
2	absolute_ci_high	0.05002897	0.05002897

ttest validation passed

2) Conversion z-test: model vs Statsmodels two-proportion tools

Result

	metric	DiffInMeans	External	abs_diff
0	p_value	0.00003794	0.00003794	6.78575035e-17
1	absolute_ci_low	0.01110763	0.01110763	0.00000000e+00
2	absolute_ci_high	0.05665834	0.05665834	0.00000000e+00

conversion_ztest validation passed

3) Bootstrap diff-in-means: model vs SciPy bootstrap

Result

	metric	DiffInMeans	External	abs_diff
0	p_value	0.00003671	0.00004721	0.00001050
1	absolute_ci_low	0.01772064	0.01779606	0.00007543
2	absolute_ci_high	0.04989633	0.05045165	0.00055531

bootstrap validation passed

4) Confidence intervals side-by-side

Result

	method	source	p_value	abs_low	abs_high
0	bootstrap	DiffInMeans	0.00003671	0.01772064	0.04989633
1	bootstrap	External	0.00004721	0.01779606	0.05045165
2	conversion_ztest	DiffInMeans	0.00003794	0.01110763	0.05665834
3	conversion_ztest	External	0.00003794	0.01778697	0.05001138
4	ttest	DiffInMeans	0.00003740	0.01779692	0.05002897
5	ttest	External	0.00003740	0.01779692	0.05002897

In conclusion: Diff_in_Means model is implemented correctly.