This implementation of statistical significance testing is designed for interpreting the results of controlled A/B tests when the population size is very large. My particular implementation was built to analyze the results of a longitudinal A/B email test, but this methodology can be re-purposed for any traditional A/B test. The biggest challenge that I faced was calculating the statistical significance of continuous variables (such as sales per email recipient), which required customer level data in order to calculate the standard deviation. In my case, pulling in all the customer level data for each email across several months of email campaigns resulted in hundreds of millions of records (which would only continue to quickly grow throughout the year.) We needed a reporting solution that would perform well with potentially 1 billion records. While I will include the calculations for dichotomous metrics (rates of binary values such as open rate), the focus will be on implementing a method to calculate the statistical significance of continuous variables that won't jeopardize performance when working with large population sizes.
Use Case Checklist
- You want to see if there is a clear "winner" in a controlled A/B test
- You are slicing the data by a number of segments that makes it difficult to have the p-values for all the combinations pre-calculated by a statistical package and added into the data on the back-end
- You want to calculate the statistical lift for either continuous, discrete, or dichotomous metrics
- If you're including continuous/discrete metrics:
- You have access to the individual level data (such as customer level, website session level etc.)
- The number of individual records with a non-zero amount is a feasible number of records to bring into Tableau
This implementation uses a two-sided t-test to estimate the z-score and corresponding p-values. For dichotomous values I am using a proportional t-test with a pooled variance and for continuous values I am using a Welch's t-test with the Satterthwaite approximation of the standard error. (See links below for more information on deciding between using a pooled vs unpooled variance.)
The p-values must be attained from a p-value lookup table (Tableau cannot calculate this value.) Because my population sizes were large enough, I was able to use the p-values based on the estimated z-score from the t-test using a standard normal table. I created several p-value buckets (<5%, 5-10%, 10-15%, 15-20%, >20%) based on the z-score lookup numbers.
Reducing the Data
As previously mentioned, I needed a way to calculate the statistical significance for the continuous variables without impacting dashboard performance. With continuous variables, you cannot aggregate the data because you need to calculate the difference from the mean for each observation in order to calculate the standard deviation. I also could not calculate the standard deviation prior to loading the data into Tableau because we wanted the users to be able to slice and dice the data by multiple dimensions and have Tableau dynamically calculate the statistical significance.
While I previously said that you cannot aggregate the data for continuous variables, there is an exception. The key to reducing the data is to aggregate all the observations with a value of 0. If you know how many records have a value of zero, you can calculate the weighted difference from the mean for those observations and plug this number into the standard deviation calculation. In my case, the data reduction was enormous: I was able to dramatically reduce the data to a fraction of the original size because such a low portion of email recipients actually had sales.
Data Structure Overview
There will be two types of records: individual level records for when an email resulted in a purchase and aggregated records for the emails that did not result in a purchase. Each unique combination of dimensions (in this case date, segment, and group) should have an aggregated record summarizing the emails without a purchase. You can include a unique id field for the individual level data and a flag that indicates whether the data is aggregated.
Manipulating the Calculations
In Tableau 8, it can be very cumbersome and/or impossible to aggregate comparisons of row level vs summarized values, which presents a challenge when trying to dynamically calculate the sum of squared residuals (required to get the z-score for continuous variables.) Luckily, Tableau provides a variance function that we can use. Unfortunately, it's not so simple because we must hack the formula in order to account for the weighted records with zero sales. Let's take a look at the formulas:
Z-Score Calculation for Continuous Variables:
Biased Weighted Sample Variance:
If we tried to use the variance function on the sales field (in order to calculate variance for SPR), Tableau will count each row as an individual record and would not take into consideration that the aggregated records represent more than one observation. We need to edit the variance calculation in order to weight the records with zero sales by breaking the sum of squared residuals into two components.
For records with no sales, we will need to weight (0-SPR)^2 by the number of delivered emails without a purchase. Now we can use the variance function to get the sum of squared residuals for records with sales. First, we need a new calculated field (replicated for both test and control groups) that nullifies sales values for records without a conversion (so that the variance function ignores records with zero sales.)
The variance function in Tableau is based on an unbiased estimator (where the sum of squared residuals is divided by N-1.) We must multiple the variance function by N-1 to get the correct sum of squared residuals (where N is the number of delivered emails with sales.)
Unfortunately, when we apply the variance function to individuals with sales, we aren't using the true population mean for SPR--we are using the population mean for those who purchased (this is the average order amount), but the true SPR will be much lower when we account for all the zeros. We need to adjust this calculation to correct for the error.
We are overestimating the SPR by AOV-SPR, which is causing us to underestimate the squared difference from the mean for each record by (AOV-SPR)**2. If we multiple this by the number of impacted records (the number of people who purchased), we are underestimating the sum of squared residuals by delivered*OPR*(AOV-SPR)**2. We can simply add this amount back into the numerator to accurately calculate the biased weighted sample variance.
We have now successfully calculated the z-score for the continuous variables. The z-score for the dichotomous variables is much simpler:
note: BAU is based on the control population and TEST is based on the test population. No specification indicates a pooled population.
Using the Z-Scores to Determine Statistical Significance
Unfortunately, the calculation for the p-value is way too complex to implement in Tableau; however, we can lookup the z-score in the standard normal table to get the corresponding p-values. You can create several p-value buckets based on the z-score cut-offs.
You can also highlight/flag values in chart based on whether the lift is statistically significant. In the email dashboard, I allowed the user the option to select different levels of significance via a parameter. The lift values were color coded to indicate if the results are statistically significance at the selected significance level.
*these numbers are not real and were created for illustrative purposes
Impact of Model on Sales per Delivery
*these numbers are not real and were created for illustrative purposes