Ad-hoc A/B test evaluation using Ep-Stats¶
This is a simplified version of general manual Using Ep-Stats in Jupyter. In this case we assume simple DataFrame at the input. It should contain aggregated data of an A/B test in a wide format.
Next we define metrics and checks we are interested in. Finally, we evaluate the experiment and nicely formate results.
Input DataFrame Example¶
Mind that you need to prepare experiment data on your own. Following example is only illustrative.
You should be aware of following assumptions:
- First two columns must contain name of the experiment and variants. Names of the columns may vary.
- It is necessary to download squared values for continuous metrics like Revenue per Mille (RPM). If you forget to do it, results will be wrong and misleading. Ep-Stats will not warn you about this issue!
# This is only example to show required format of the input DataFrame
# You have to prepare aggregated data on your own, e.g. using SQL
from epstats.toolkit.testing import TestData
goals = TestData.load_goals_simple_agg()
goals
experiment | variant | views | clicks | conversions | bookings | bookings_squared | |
---|---|---|---|---|---|---|---|
0 | test-simple-metric | a | 473661 | 48194 | 413 | 17152 | 803105 |
1 | test-simple-metric | b | 471485 | 47184 | 360 | 14503 | 677178 |
For continous metrics like RPM (RPM = bookings / views * 1000) it is necessary to prepare squared values - in this case we have columns bookings
and bookings_squared
.
Lets assume we have $K$ purchases. Hence the exact definition of columns bookings
and bookings_squared
is following
$$\text{bookings} = \sum_{i=1}^{K} \text{purchase_value}_{i}$$ $$\text{bookings_squared} = \sum_{i=1}^{K} (\text{purchase_value}_{i})^2$$
This is not necessary for binary metrics like Click-through Rate or Conversion Rate.
Experiment Definition and Evaluation¶
Firstly, you need to define metrics you want to evaluate. You can define as many metrics as you want. While creating the instance of the class SimpleMetric
you need to specify parameters id
, name
, numerator
and denominator
. Further you can specify optional parameters metric_format
, e.g. '${:,.1f}' for RPM, and parameter metric_value_multiplier
, e.g. 1000 for RPM. The last optional parameter unit_type
is preset only for technical reason. Be aware there can be used only one unit_type
within one experiment. The value of this parameter has no impact on the evaluation.
Secondly, you can define checks as well by creating the instance of the class SimpleSrmCheck
. It is not mandatory to define checks - keep it empty if you do not need one.
You wrap both metrics and checks definitions inside the Experiment definition. For more details see Experiment
.
Finally, you evaluate the experiment calling method evaluate_wide_agg
method, for details see Experiment.evaluate_wide_agg()
. The results for metrics and checks are separated.
from epstats.toolkit import Experiment, SimpleMetric, SimpleSrmCheck
unit_type='test_unit_type' # this is only technical detail; it has no impact on the results
# Experiment Definition
experiment = Experiment(
'test-simple-metric',
'a',
[
SimpleMetric(1, 'Click-through Rate (CTR)', 'clicks', 'views', unit_type),
SimpleMetric(2, 'Conversion Rate', 'conversions', 'views', unit_type),
SimpleMetric(3, 'Revenue per Mille (RPM)', 'bookings', 'views', unit_type, metric_format='${:,.2f}', metric_value_multiplier=1000),
],
[SimpleSrmCheck(1, 'SRM', 'views')],
unit_type=unit_type)
# Experiment Evaluation
# `goals` is the DataFrame you have prepared on your own, e.g. using SQL
ev = experiment.evaluate_wide_agg(goals)
# Resluts
ev.checks
ev.metrics
timestamp | exp_id | metric_id | metric_name | exp_variant_id | count | mean | std | sum_value | confidence_level | diff | test_stat | p_value | confidence_interval | standard_error | degrees_of_freedom | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1615928648 | test-simple-metric | 1 | Click-through Rate (CTR) | a | 473661 | 0.101748 | 0.302317 | 48194 | 0.95 | 0 | 0 | 1 | 0.0119665 | 0.00610546 | 947320 |
1 | 1615928648 | test-simple-metric | 1 | Click-through Rate (CTR) | b | 471485 | 0.100075 | 0.300101 | 47184 | 0.95 | -0.0164385 | -2.72161 | 0.00649657 | 0.0118382 | 0.00603998 | 945136 |
2 | 1615928648 | test-simple-metric | 2 | Conversion Rate | a | 473661 | 0.000871932 | 0.0295156 | 413 | 0.95 | 0 | 0 | 1 | 0.136333 | 0.0695586 | 947320 |
3 | 1615928648 | test-simple-metric | 2 | Conversion Rate | b | 471485 | 0.000763545 | 0.0276218 | 360 | 0.95 | -0.124306 | -1.96949 | 0.048897 | 0.123705 | 0.063116 | 941568 |
4 | 1615928648 | test-simple-metric | 3 | Revenue per Mille (RPM) | a | 473661 | 0.0362116 | 1.30162 | 17152 | 0.95 | 0 | 0 | 1 | 0.144766 | 0.0738616 | 947320 |
5 | 1615928648 | test-simple-metric | 3 | Revenue per Mille (RPM) | b | 471485 | 0.0307603 | 1.19805 | 14503 | 0.95 | -0.15054 | -2.29841 | 0.0215384 | 0.128373 | 0.0654974 | 939408 |
Formatting Results¶
You may find useful two methods for nice presentation of results - results_long_to_wide
and format_results
.
The former simply convert results from long format to wide one. The later then provide extra tuning. You can set number of decimals defining parameters format_pct
and format_pval
respectively.
from epstats.toolkit.results import results_long_to_wide, format_results
ev.metrics.pipe(results_long_to_wide)
metric_name | Click-through Rate (CTR) | Conversion Rate | Revenue per Mille (RPM) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
statistic | mean | diff | conf_int_lower | conf_int_upper | p_value | mean | diff | conf_int_lower | conf_int_upper | p_value | mean | diff | conf_int_lower | conf_int_upper | p_value | |
exp_id | exp_variant_id | |||||||||||||||
test-simple-metric | A | 0.101748 | 0 | -0.0119665 | 0.0119665 | 1 | 0.000871932 | 0 | -0.136333 | 0.136333 | 1 | 0.0362116 | 0 | -0.144766 | 0.144766 | 1 |
B | 0.100075 | -0.0164385 | -0.0282766 | -0.00460032 | 0.00649657 | 0.000763545 | -0.124306 | -0.248012 | -0.000601163 | 0.048897 | 0.0307603 | -0.15054 | -0.278913 | -0.0221675 | 0.0215384 |
ev.metrics.pipe(results_long_to_wide).pipe(format_results, experiment, format_pct='{:.1%}', format_pval='{:.3f}')
Metric | Click-through Rate (CTR) | Conversion Rate | Revenue per Mille (RPM) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Statistics | Mean | Impact | Conf. interval lower bound | Conf. interval upper bound | p-value | Mean | Impact | Conf. interval lower bound | Conf. interval upper bound | p-value | Mean | Impact | Conf. interval lower bound | Conf. interval upper bound | p-value | |
Experiment Id | Variant | |||||||||||||||
test-simple-metric | A | 10.17% | 0.0% | -1.2% | 1.2% | 1.000 | 0.09% | 0.0% | -13.6% | 13.6% | 1.000 | $36.21 | 0.0% | -14.5% | 14.5% | 1.000 |
B | 10.01% | -1.6% | -2.8% | -0.5% | 0.006 | 0.08% | -12.4% | -24.8% | -0.1% | 0.049 | $30.76 | -15.1% | -27.9% | -2.2% | 0.022 |