Quantiles divide ordered data into a series of essentially equal-sized data subsets. The quantiles are the data values marking the boundaries between consecutive subsets. The most commonly used numbers of divisions are percentiles (100), deciles (10), quartiles (4) and median (2).
This is a simplified version of the calculation described previouslyThis version is designed to be used with Tableau version 8 or above only, since it takes advantage of table calculation optimisations introduced in version 8.
// p = 0 => minimum, so lookup the first row IF ([p] == 0) THEN LOOKUP(ATTR([A]), FIRST()) // p = 1 => maximum, so lookup the last row ELSEIF ([p] == 1) THEN LOOKUP(ATTR([A]), LAST()) // otherwise select the appropriate value ELSE // The WINDOW_MIN() finds the lowest value at or beyond the required quantile. WINDOW_MIN( // required quantile is at the mid-point between two values, so take the average IF ([p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records]))) THEN (ATTR([A]) + LOOKUP(ATTR([A]), 1)) / 2 // row is beyond required quantile ELSEIF ([p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records]))) THEN ATTR([A]) ELSE NULL END ) END
Inputs and Setup:
[A]: A numeric field, typically a measure. All values of A need to be available to the calculation, so A must be placed on a shelf (typically Level of Detail) disaggregated (i.e. as a dimension).
[p]: The required quantile in the range 0 to 1 (e.g. 0.95 for the 95th percentile). [p] may be provided by a parameter, a field value or may be hard-coded.
Partitioning and Addressing:
Compute Using [A]
The calculation is partitioned by all other dimensions in use in the view and is well-behaved as dimensions are added or removed. Note that there is no need to use advanced partitioning and ordering - the default ordering is to sort by the Compute Using field, which is the order required for the calculation.
There are numerous different definitions of quantiles, producing subtly different answers, with the differences mainly concerning truncation, rounding or interpolation to choose the required value when the quantile boundary does not fall exactly on a value in the data set.
This is the 2nd out of 10 algorithms given on Wikipedia (referred to as R-2, SAS-5). It has been chosen for consistency, since it is the algorithm used by Tableau for reference lines.
This calculation scales well to very large data volumes, since it does not require all underlying rows to be returned to Tableau (just counts of each distinct value),
It can be speeded up further and made to work over essentially unlimited size data sets by reducing the precision of the result. The technique to achieve this and an example workbook are described on my company blog here.
As expressed above the calculation returns the same value for each row in the partition. In probably the majority of cases, however, what is really needed is an aggregate function returning a single row for each partition. This can be achieved by filtering to limit the results - either by defining a filter condition such as (FIRST() == 0), or by wrapping the entire calculation in a IF (FIRST() == 0) THEN ... END block and filtering out the NULLs. Both of these approaches are demonstrated in the attached sample workbook.
The sample workbook shown below calculates a chosen percentile Sales value for the sample Coffee Sales data. This demonstrates the display of percentiles against a continuous dimension, which is not possible with percentile reference lines. The percentile value to be shown is selected with a parameter.
Note that the sample workbook also includes an alternative, slightly simpler algorithm (the 1st one in the Wikipedia article, referred to as R-1, SAS-3) (see the second tab). The simpler algorithm differs in that it does not interpolate when the requested quantile falls at the mid-point between two values. This can lead to slight differences, which are most commonly visible when calculating medians, as shown on the third tab.
To understand the reason for the differences, consider evaluating the median of the values:
10, 20, 30, 40
The algorithm used in the calculation given in this article would return 25 (the average of 20 and 30) whilst the simpler algorithm would return 20.
On a large data set, the simpler algorithm can be around 20% faster than the one quoted above.
Thanks to Joe Mako for his assistance in reviewing this calculation, and in particular for suggesting some simplifications which reduced the run time significantly, and for suggesting including the alternative, simpler algorithm.
Equinox blog posting describing how the calculation works and also explaining how to make it faster and more scalable by reducing the precision of the result.