Name: Quantile (version 2)
Alternate names: Median, quartile, decile, percentile, etc (which are all specific examples of quantiles)
This is an improved version of the calculation described here.
Quantiles divide ordered data into a series of essentially equal-sized data subsets; the quantiles are the data values marking the boundaries between consecutive subsets. (This is paraphrased from the definition given on Wikipedia.)
// Only calculate for first row of the partition, otherwise return NULL IF (FIRST()==0) THEN // p = 0 => minimum, and we are on the first row, so we just want the current value IF ([p] == 0) THEN ATTR([A]) // p = 1 => maximum, so lookup the last row ELSEIF ([p] == 1) THEN LOOKUP(ATTR([A]), LAST()) // otherwise select the appropriate value ELSE // The WINDOW_MIN() finds the lowest value at or beyond the required quantile. WINDOW_MIN( // required quantile is at the mid-point between two values, so take the average IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN (ATTR([A]) + LOOKUP(ATTR([A]), 1)) / 2 // row is beyond required quantile ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN ATTR([A]) ELSE NULL END, 0, IIF(FIRST()==0, LAST(), 0) ) END END
Compute Using: A
Note that unlike the earlier version, there is no need to use advanced partitioning and ordering - the default ordering is to sort by the Compute Using field, which is the order required for the calculation.
Inputs and Setup:
A: A numeric field, typically a measure. All values of A need to be available to the calculation, so A must be placed on a shelf (typically Level of Detail) disaggregated (i.e. as a dimension).
p: The required quantile (e.g. 0.95 for the 95th percentile).
[p] may be provided by a parameter, a field value or may be hard-coded.
As quoted above, the calculation only returns the quantile value for the first row of each partition and NULL for all other rows. This is for efficiency reasons and is designed to allow the calculation to act as an aggregate function. Typically all other rows in the partition will be removed from the view after evaluation of the calculated field. This can be achieved either by placing a copy of the field on the filters shelf and choosing "Special", "Non-null values", or by formatting the marks to hide NULL values.
If the function is needed as an analytic function, with the result avaialble for all rows, the outer IF (FIRST()==0) THEN ... END should be removed.
IF (FIRST()==0) THEN IF ([p] == 0) THEN ATTR([Sales]) ELSEIF ([p] == 1) THEN LOOKUP(ATTR([Sales]), LAST()) ELSE WINDOW_MIN( IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN (ATTR([Sales]) + LOOKUP(ATTR([Sales]), 1)) / 2 ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN ATTR([Sales]) ELSE NULL END, 0, IIF(FIRST()==0, LAST(), 0) ) END END
There are numerous different definitions of quantiles, producing subtly different answers, with the differences mainly concerning truncation, rounding or interpolation to choose the required value when the quantile boundary does not fall exactly on a value in the data set.
This is the 2nd out of 10 algorithms given on Wikipedia (referred to as R-2, SAS-5). It has been chosen for consistency, since it is the algorithm used by Tableau for reference lines.
This calculation is a substantial improvement on the one used in the original TCRL quantile calculation, for several reasons.
1) The partitioning and ordering definition is much simpler, which makes it much easier to get the calculation working and much more robust as the view layout changes (i.e. dragging new dimensions onto shelves doesn't keep breaking it).
2) As it does not require all underlying rows to be returned to Tableau (just counts of each distinct value), it is much faster and works on much larger data sets.
3) It can be speeded up further and made to work over essentially unlimited size data sets by reducing the precision of the result. The technique to achieve this and an example workbook are described here.
The attached sample workbook calculates a chosen percentile Sales value for the sample Coffee Sales data. This demonstrates the display of percentiles against a continuous dimension, which is not possible with percentile reference lines.
Note that the sample workbook also includes an alternative, slightly simpler algorithm (the 1st one in the Wikipedia article, referred to as R-1, SAS-3). The simpler algorithm differs in that it does not interpolate when the requested quantile falls at the mid-point between two values. This can lead to slight differences, which are most commonly visible when calculating medians. For example, when evaluating the median of the values:
10, 20, 30, 40
the algorithm used in the above calculation would return 25 (the average of 20 and 30) whilst the simpler algorithm would return 20.
On a large data set, the simpler algorithm can be around 20% faster than the one quoted above.
Thanks to Joe Mako for his assistance in reviewing this calculation, and in particular for suggesting some simplifications which reduced the run time significantly, and for suggesting including the alternative, simpler algorithm.
quantile_version_2_sample.twbx 202.9 KB