Benchmarking: DiffInMeans vs Scipy implementation
Goal: verify that DiffInMeans implementations of ttest, conversion_ztest, and bootstrap
match raw reference calculations from SciPy/Statsmodels.
Result
n_control=4955, n_treated=5045 control mean=0.198991, treated mean=0.232904
1) Welch t-test: model vs SciPy/Statsmodels
Result
| metric | DiffInMeans | External | abs_diff | |
|---|---|---|---|---|
| 0 | p_value | 0.00003740 | 0.00003740 | 0.0 |
| 1 | absolute_ci_low | 0.01779692 | 0.01779692 | 0.0 |
| 2 | absolute_ci_high | 0.05002897 | 0.05002897 | 0.0 |
ttest validation passed
2) Conversion z-test: model vs Statsmodels two-proportion tools
Result
| metric | DiffInMeans | External | abs_diff | |
|---|---|---|---|---|
| 0 | p_value | 0.00003794 | 0.00003794 | 6.78575035e-17 |
| 1 | absolute_ci_low | 0.01110763 | 0.01110763 | 0.00000000e+00 |
| 2 | absolute_ci_high | 0.05665834 | 0.05665834 | 0.00000000e+00 |
conversion_ztest validation passed
3) Bootstrap diff-in-means: model vs SciPy bootstrap
Result
| metric | DiffInMeans | External | abs_diff | |
|---|---|---|---|---|
| 0 | p_value | 0.00003671 | 0.00004721 | 0.00001050 |
| 1 | absolute_ci_low | 0.01772064 | 0.01779606 | 0.00007543 |
| 2 | absolute_ci_high | 0.04989633 | 0.05045165 | 0.00055531 |
bootstrap validation passed
4) Confidence intervals side-by-side
Result
| method | source | p_value | abs_low | abs_high | |
|---|---|---|---|---|---|
| 0 | bootstrap | DiffInMeans | 0.00003671 | 0.01772064 | 0.04989633 |
| 1 | bootstrap | External | 0.00004721 | 0.01779606 | 0.05045165 |
| 2 | conversion_ztest | DiffInMeans | 0.00003794 | 0.01110763 | 0.05665834 |
| 3 | conversion_ztest | External | 0.00003794 | 0.01778697 | 0.05001138 |
| 4 | ttest | DiffInMeans | 0.00003740 | 0.01779692 | 0.05002897 |
| 5 | ttest | External | 0.00003740 | 0.01779692 | 0.05002897 |
In conclusion: Diff_in_Means model is implemented correctly.