1 2 3 Previous Next 86 Replies Latest reply on Dec 12, 2015 6:58 AM by Madhusudhan Khambham

    Table Calculation: Quantile (version 2)

    Richard Leeke

      Name: Quantile (version 2)

      Alternate names: Median, quartile, decile, percentile, etc (which are all specific examples of quantiles)

       

      Summary:

      This is an improved version of the calculation described here.

       

      Quantiles divide ordered data into a series of essentially equal-sized data subsets; the quantiles are the data values marking the boundaries between consecutive subsets. (This is paraphrased from the definition given on Wikipedia.)

       

      Definition:

       

      
       // Only calculate for first row of the partition, otherwise return NULL
      IF (FIRST()==0) THEN
          // p = 0 => minimum, and we are on the first row, so we just want the current value
          IF ([p] == 0) THEN
              ATTR([A])
          // p = 1 => maximum, so lookup the last row
          ELSEIF ([p] == 1) THEN
              LOOKUP(ATTR([A]), LAST())
          // otherwise select the appropriate value
          ELSE
              // The WINDOW_MIN() finds the lowest value at or beyond the required quantile.
              WINDOW_MIN(
                  // required quantile is at the mid-point between two values, so take the average
                  IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN
                      (ATTR([A]) + LOOKUP(ATTR([A]), 1)) / 2
                  // row is beyond required quantile
                  ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN
                      ATTR([A])
                  ELSE
                      NULL
                  END,
                  0,
                  IIF(FIRST()==0, LAST(), 0)
              )
          END
      END
      
      


       

      Compute Using: A

       

      Note that unlike the earlier version, there is no need to use advanced partitioning and ordering - the default ordering is to sort by the Compute Using field, which is the order required for the calculation.

       

      Inputs and Setup:

      A: A numeric field, typically a measure. All values of A need to be available to the calculation, so A must be placed on a shelf (typically Level of Detail) disaggregated (i.e. as a dimension).

      p: The required quantile (e.g. 0.95 for the 95th percentile).

       

      [p] may be provided by a parameter, a field value or may be hard-coded.

       

      As quoted above, the calculation only returns the quantile value for the first row of each partition and NULL for all other rows.  This is for efficiency reasons and is designed to allow the calculation to act as an aggregate function.  Typically all other rows in the partition will be removed from the view after evaluation of the calculated field.  This can be achieved either by placing a copy of the field on the filters shelf and choosing "Special", "Non-null values", or by formatting the marks to hide NULL values.

       

      If the function is needed as an analytic function, with the result avaialble for all rows, the outer IF (FIRST()==0) THEN ... END should be removed.

       

      Example:

       

      
       IF (FIRST()==0) THEN
          IF ([p] == 0) THEN
              ATTR([Sales])
          ELSEIF ([p] == 1) THEN
              LOOKUP(ATTR([Sales]), LAST())
          ELSE
              WINDOW_MIN(
                  IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN
                      (ATTR([Sales]) + LOOKUP(ATTR([Sales]), 1)) / 2
                  ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN
                      ATTR([Sales])
                  ELSE
                      NULL
                  END,
                  0,
                  IIF(FIRST()==0, LAST(), 0)
              )
          END
      END
      
      


       

      Commentary:

      There are numerous different definitions of quantiles, producing subtly different answers, with the differences mainly concerning truncation, rounding or interpolation to choose the required value when the quantile boundary does not fall exactly on a value in the data set.

       

      This is the 2nd out of 10 algorithms given on Wikipedia (referred to as R-2, SAS-5).  It has been chosen for consistency, since it is the algorithm used by Tableau for reference lines.

       

       

      This calculation is a substantial improvement on the one used in the original TCRL quantile calculation, for several reasons.

       

      1) The partitioning and ordering definition is much simpler, which makes it much easier to get the calculation working and much more robust as the view layout changes (i.e. dragging new dimensions onto shelves doesn't keep breaking it).

       

      2) As it does not require all underlying rows to be returned to Tableau (just counts of each distinct value), it is much faster and works on much larger data sets.

       

      3) It can be speeded up further and made to work over essentially unlimited size data sets by reducing the precision of the result.  The technique to achieve this and an example workbook are described here.

       

       

      The attached sample workbook calculates a chosen percentile Sales value for the sample Coffee Sales data.  This demonstrates the display of percentiles against a continuous dimension, which is not possible with percentile reference lines.

       

      Note that the sample workbook also includes an alternative, slightly simpler algorithm (the 1st one in the Wikipedia article, referred to as R-1, SAS-3).  The simpler algorithm differs in that it does not interpolate when the requested quantile falls at the mid-point between two values.  This can lead to slight differences, which are most commonly visible when calculating medians.  For example, when evaluating the median of the values:

       

          10, 20, 30, 40

       

      the algorithm used in the above calculation would return 25 (the average of 20 and 30) whilst the simpler algorithm would return 20.

       

      On a large data set, the simpler algorithm can be around 20% faster than the one quoted above.

       

       

      Thanks to Joe Mako for his assistance in reviewing this calculation, and in particular for suggesting some simplifications which reduced the run time significantly, and for suggesting including the alternative, simpler algorithm.

        • 1. Re: Table Calculation: Quantile (version 2)
          James Baker

          Awesome!  I look forward to spending some more time with this later... :)

          • 2. Re: Table Calculation: Quantile (version 2)
            Richard Leeke

            One slight correction:  If the function is needed as an analytic function, with the result available for all rows, the outer IF (FIRST()==0) THEN ... END should be removed AND the WINDOW_MIN() needs to have the partitioning clause removed (I missed the 2nd bit).  So the analystic version of the function would become:

             

             

            
             IF ([p] == 0) THEN
                ATTR([A])
            ELSEIF ([p] == 1) THEN
                LOOKUP(ATTR([A]), LAST())
            ELSE
                WINDOW_MIN(
                    IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN
                        (ATTR([A]) + LOOKUP(ATTR([A]), 1)) / 2
                    ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN
                        ATTR([A])
                    ELSE
                        NULL
                    END
                )
            END
            
            


            • 3. Re: Table Calculation: Quantile (version 2)
              Richard Leeke

              Actually, thinking about that last comment again - if there is a significant number of rows in the partitions (a few thousand or more), then it will be really slow without the code to optimise the WINDOW_MIN() function.  So for the analytic version (returning the result for all rows), the expression should be this:

               

              
               PREVIOUS_VALUE(
                  IF (FIRST()==0) THEN
                      IF ([p] == 0) THEN
                          ATTR([A])
                      ELSEIF ([p] == 1) THEN
                          LOOKUP(ATTR([A]), LAST())
                      ELSE
                          WINDOW_MIN(
                              IF [p] * TOTAL(SUM([Number of Records])) = RUNNING_SUM(SUM([Number of Records])) THEN
                                  (ATTR([A]) + LOOKUP(ATTR([A]), 1)) / 2
                              ELSEIF [p] * TOTAL(SUM([Number of Records])) < RUNNING_SUM(SUM([Number of Records])) THEN
                                  ATTR([A])
                              ELSE
                                  NULL
                              END,
                              0,
                              IIF(FIRST()==0, LAST(), 0)
                          )
                      END
                  END
              )
              
              


              • 4. Re: Table Calculation: Quantile (version 2)
                Jesse Gebhardt

                I came up with a slightly different take on the Quantile problem.  I believe my workbook answers a different question than your approach.  This will help users that want to 'break Customers (or some dimension) into deciles/quartiles/etc. and then sum/avg/etc. a particular measure for all the customers in that bin.

                 

                I dont know if this is the best place to post it, but I figure you'd help me find the best forum for it.

                • 5. Re: Table Calculation: Quantile (version 2)

                  Thank you very much for sharing this solution.  I believe I am trying to accomplish something very similar, but perhaps a bit more complex.

                   

                  I'm looking to create n percentiles (N tiles as labeled in the prior tableau example) where there is the same summed balance in each percentile. Some cohorts will have very many data elements (each with a small balance) while others will have just a few data elements (each with a large balance).  I'm then looking to calculate either the balance weighted average CPR (another field value) or just the last value CPR for a data element in that category.  Either is acceptable. 

                   

                  I've attached a sample data set.  Another wrinkle is that this will need to be performed based on additional groupings in the data (such as date, or coupon).  I included coupon in the sample data set to give you an example of it.  Hopefully this doesn't really complicate matters.

                   

                  One other thing that is key.  The data needs to first be sorted by CPR so that as one moves up the percentiles the CPR increases/decreases.  Here is an example I worked up in Excel:

                   

                  CouponSummed BalancePercentile SizeLast CPR
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.01
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.10
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.20
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.26
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.26
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.28
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.29
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.36
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.40
                  4.5 $ 378,136,728,898 $ 37,813,672,890        0.95
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.12
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.20
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.24
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.28
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.30
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.32
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.35
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.38
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.45
                  5.0 $ 284,355,665,826 $ 28,435,566,583        0.98

                   

                   

                  Thanks,


                  David

                  • 6. Re: Table Calculation: Quantile (version 2)
                    Natasha Thurston

                    This is fantastic, and does exactly what I need. Thanks so much for taking the time to share it with us!!!

                    • 7. Re: Table Calculation: Quantile (version 2)
                      Jonathan Drummey

                      This is an awesome bit of work. One note, in Tableau 7 I had to explicitly set the Compute using/Order Along to sort along the measure, otherwise the quantiles were calculated along a different sort and were incorrect. This might be a padding issue or due to other table calcs that were in the view.

                      • 8. Re: Table Calculation: Quantile (version 2)
                        Richard Leeke

                        Thanks very much for the heads-up on the Compute Using change, Jonathan - that would be such an easy trap to fall into. I'll check this out - and warn my team - it's quite possible that some workbooks may have stopped working since version 7.

                        • 9. Re: Table Calculation: Quantile (version 2)
                          Richard Leeke

                          Can you post a bit more detail (or a sample workbook) to explain the compute using issue with version 7, Jonathan? I'm really keen to understand it - but I've looked at a few workbooks using this approach and they give the same results in 6.1 and 7.0, so I'm guessing there's something to do with the struture of your view, as you say.

                          • 10. Re: Table Calculation: Quantile (version 2)
                            Jonathan Drummey

                            Hi Richard,

                             

                            You'll recognize the basic situation, this is the next step in the the partitioning table calculations post you replied to on Friday/Saturday. In the attached sample workbook, the goal is to compute an Overall Rank and Percentile based on an Average Rank calculated outside Tableau, and I found that for those calculations to work I had to explicitly set them to be computed using the Average Rank in ascending order.

                             

                            There are two things that I'm wondering about that might be affecting this: The raw data is sorted on the Provider name and then the Measure Name in ascending alphabetical order, and Tableau 7.0 seems to be using that same sort. Also, there are two dimensions being used in the view.

                             

                            Also, I tried this with 6.1 and go the same results. Any thoughts  you have are welcome!

                             

                            Jonathan

                            • 11. Re: Table Calculation: Quantile (version 2)
                              Richard Leeke

                              Hi Jonathan

                               

                              OK, I see what is happening. Basically you need to set Compute Using to [Average Rank], rather than Provider. Here's what I did in the attached version of your workbook.

                               

                              1) I removed [Measure Name] (as opposed to the built in [Measure Names] from the view as you really don't need these - since the external ranking has already produced the average across all of these.

                               

                              2) I also changed Provider to ATTR(Provider) - you don't want to partition by this - you want to compute the percentiles across all Providers.

                               

                              3) I changed the calculations to Compute Using [Average Rank] - this automatically sorts by the value that you are calculating the percentile of. Note that I had to make [Average Rank] a dimension - I explicitly want one row for each distinct value of [Average Rank]. The percentile calculation allows for the number of instances of each value by factoring in [Number of Records].

                               

                              Note that this last point (letting the database count the duplicates) is where this algorithm gets a massive performance gain from over previous ways I've tried. (I've run this successfully on 100 million rows.)

                               

                              So it's not a change in behaviour with Tableau 7 after all. I'm mighty relieved - I thought we might have been mis-reporting results these last couple of months.

                              • 12. Re: Table Calculation: Quantile (version 2)
                                Jonathan Drummey

                                Ok, setting the Compute Using to [Average Rank] totally makes sense - every once in awhile I confuse myself about how Tableau sorts, and this was one of those times. And I'd somehow skipped over that explicit instruction you'd written about setting the Compute Using to the measure. Thanks for the correction!

                                 

                                FYI, I'd had the [Measure Name] on the view because this is new enough that the physicians need to see the raw rates per measure so they can see enough of the data they already know (their raw rates) in order to get over their defensiveness about being ranked, and I wanted to be sure that the calculations would work properly in that case. I'll play with it to get it to work.

                                .

                                Thanks again,

                                 

                                Jonathan

                                • 13. Re: Table Calculation: Quantile (version 2)
                                  Joel Mullis

                                  I am trying to use quartile metrics to show specific trends as it relates to using the average.  My issue is how long Tableau takes with large data sets.  45k lines or so.  I am sure it is the way that I have written my formulas for the 1st,median,3rd quartiles.  I have read the string but can't figure out why my data is taking so long to update. 

                                   

                                  Attached is the workbook that I am working on.

                                   

                                  Any help would be greatly appreciated. 

                                   

                                  Joel

                                  • 14. Re: Table Calculation: Quantile (version 2)
                                    Richard Leeke

                                    Probably just the issue with WINDOW_xxx() functions not performing well. That's why I've got really convoluted window expresssions in my quantile formula.

                                     

                                    I'll have a go at tweaking your calculations and also see how my version behaves on your data when I get a chance.

                                    1 2 3 Previous Next