14 Replies Latest reply on Jan 15, 2012 9:11 PM by Richard Leeke

    TCRL: Quantile

    Richard Leeke

      My first TCRL contribution...

       

      Name: Quantile

      Alternate names: Median, quartile, decile, percentile, etc (which are all specific examples of quantiles)

       

      Summary:

      Quantiles divide ordered data into a series of essentially equal-sized data subsets; the quantiles are the data values marking the boundaries between consecutive subsets.  (This is paraphrased from the definition given at: http://en.wikipedia.org/wiki/Quantile.)

       

      Definition:

       

      
       IF ([k] == 0) THEN
          LOOKUP(A, 1 - INDEX())
      ELSEIF ([k] == [q]) THEN
          LOOKUP(A, SIZE() - INDEX())
      ELSEIF ((([k]/[q]) * SIZE()) == INT(([k]/[q]) * SIZE())) THEN
          (LOOKUP(A, (([k]/[q]) * SIZE()) - INDEX()) +
           LOOKUP(A, (([k]/[q]) * SIZE()) + 1 - INDEX())) / 2
      ELSE
          LOOKUP(AVG([Sales]), (ROUND((([k]/[q]) * SIZE()) + 0.5, 0) - INDEX()))
      END
      
      


       

      Compute Using: Ordered by A ascending.

      Note that the ordering must be per partition, which may require inclusion of partitioning fields in the list of addressing fields and the use of "Restarting every", as described in this thread: http://www.tableausoftware.com/forum/how-specify-ordering-within-each-partition-independently.

       

      Inputs and Setup:

      A: An aggregate measure such as AVG([Measure]) or a calculation producing one.  Typically the quantile will be calculated across the underlying detail data, so the combination of fields on the shelves must uniquely identify each row within the partition and the most appropriate aggregate function would be ATTR() (but the ordering can not be specified using ATTR()).

      q: The number of subsets to divide the data into (e.g. 2 for median or 100 for percentile).

      k: The required quantile (e.g. 95 for the 95th percentile).

       

      [q] and [k] may be provided by parameters, field values, calculated fields or may be hard-coded.

       

      Example:

       

      
       IF ([k] == 0) THEN
          LOOKUP(ATTR([Sales]), 1 - INDEX())
      ELSEIF ([k] == [q]) THEN
          LOOKUP(ATTR([Sales]), SIZE() - INDEX())
      ELSEIF ((([k]/[q]) * SIZE()) == INT(([k]/[q]) * SIZE())) THEN
          (LOOKUP(ATTR([Sales]), (([k]/[q]) * SIZE()) - INDEX()) +
           LOOKUP(ATTR([Sales]), (([k]/[q]) * SIZE()) + 1 - INDEX())) / 2
      ELSE
          LOOKUP(ATTR([Sales]), (ROUND((([k]/[q]) * SIZE()) + 0.5, 0) - INDEX()))
      END
      
      


       

      Commentary:

      There are numerous different definitions of quantiles, producing subtly different answers, with the differences mainly concerning truncation, rounding or interpolation to choose the required value when the quantile boundary does not fall exactly on a value in the data set.

       

      For the sake of consistency, this calculation implements the algorithm used by Tableau for reference lines.  (Or at least it is intended to!)

       

      This is the 2nd out of 10 algorithms given at http://en.wikipedia.org/wiki/Quantile (referred to as R-2, SAS-5).

       

      Note that the calculation given above does not attempt to validate the parameters given, so inconsistent values for [k] and [q] may produce spurious results.  In the example workbook attached I have included some basic validation (see calculated fields [Quantile Valid] which is used to suppress calculation of invalid results and [Quantile Message] which is displayed on the view Title in the case of error).  I'm not sure if this the best way to handle validation of parameters - I feel another thread coming on.

       

       

      The attached example calculates quantiles (initially percentiles, but this is parameterised) for the sample Coffee Sales data.  This demonstrates the display of percentiles against a continuous dimension, which is not possible with percentile reference lines.  A combo chart showing percentiles displayed over the detailed data is also included.

        • 1. Re: TCRL: Quantile
          Richard Leeke

          It's probably better to replace [k] and [q] with a single floating point number representing [k]/[q], in the range 0 .. 1, as Joe did in this thread:

           

          http://www.tableausoftware.com/forum/using-horizontal-%2526amp;-vertical-reference-lines-delineate-square-areas

           

          which uses this function.

          • 2. Re: TCRL: Quantile
            Jason Scarlett

            I've read through the threads multiple times, and still can't figure out how to apply the quartile/percentile to my data. I think the problem lies in understanding the partitions and how to get INDEX() to sort and number the data ... but I could be wrong.

             

            Attached is a simple workbook where I am trying to calculate the 80th percentile and display it in table form.

             

            As a side note, I also can't figure out how to display the example above in a table format instead of a line chart (I get a Tableau internal error).

            • 3. Re: TCRL: Quantile
              Richard Leeke

              Yes, you have the calculations fine, the problems are all in the partitioning.  That is the hardest part of table calculations - it took me a long time to get my head around them and I still have to stop and think.

               

              I've attached a modified version of your workbook with three extra sheets: [Raw Data] (so I could see what I was doing), [Corrected Partitioning] and [80th Percentile Reference Lines] (to check the results).  I suggest you duplicate your sheet 1 and follow along with these steps to make it come out the same as my [Corrected Partitioning] sheet.

               

              The first issue is that you were not pulling all the rows back from your data source: Tableau needs to see all of the detail rows in order to calculate the percentile.  So you need to include [start_dt] and [end_dt] on Level of Detail.  This immediately makes the table look funny - with lots of text overlaid on top of each other.  We'll come to that.

               

              I see you have included MEDIAN([Time]) in your table - presumably so you can check your answers by setting [k]=50.  That was fine before we dragged [start_dt] and [end_dt] onto Level of Detail, but now it's broken.  That's because the MEDIAN function is calculated in the database at the level of detail determined by all of the dimensions on the shelves.  As we now have enough dimensions to identify each row, we are now getting all of the individual values falling into each cell in the table (Disposition, Site and Date) on top of each other.

               

              To fix that we need to use a calculated field using a special table calculation function: TOTAL().  What TOTAL() does is it causes any aggregate functions in the expression it wraps to be evaluated over a partition of your choice (defined as for all the other table calculations), rather than to be evaluated at the lowest level of detail.  So I defined a calculated field as follows:

               

              TOTAL_MEDIAN_Time = TOTAL(MEDIAN([Time]))

               

              Drag that onto Level of Detail instead of MEDIAN([Time]).  Now you need to define the partitioning.

               

              Click on the pill for [TOTAL_MEDIAN_Time] and select Edit Table Calculation -> Advanced.  You want to calculate the median for each cell in your table, which means for each Disposition, Site and Date.  So leave them on the left in the Available Fields box (these are your partitioning fields) and move [start_dt] and [end_dt] to the Compute Using box.  That means you will calculate the median of the [Time] values for all combinations of [start_dt] and [end_dt] for each partition.  You don't need to specify the sort order for the built-in MEDIAN function to work - the database engine will work that out for itself.

               

              OK, that should give the right answer - but the text will look fairly bold because you've actually calculated the median once for every row in the partition and printed them all on top of each other.  So the last thing you need to do is filter the results so you only display a single mark per cell of the table.  To do that I changed the partitioning on [Index] on [Level of Detail] to be the same as for [TOTAL_MEDIAN_Time] and then copied (ctrl-drag) a copy of that onto the Filter shelf.  Setting the filter range to be 1 to 1 limits the final display to just one row per partition.  This works because filters on table calculation functions (which includes INDEX()) are applied after all table calculations have been evaluated.

               

              Right, now we just need to fix up the partitioning for the percentile calculations.  That is just a simple matter of applying exactly the same partitioning to your two percentile calculations - with the additional step that you need to sort the partitions into [Time] order.  It doesn't really matter which of the aggregate functions you use for the sort, since we have returned individual rows rather than aggregating over multiple rows in the database (MIN(), MAX(), AVG() and SUM() of a single value all give the same answer).  I tend to use MIN() just as a matter of habit.

               

              Lastly, you should see that the Index values aren't all 1 - even though you have filtered the view to only show rows where Index = 1.  The reason for that is that we haven't yet fixed the partitioning on the copy of Index on [Measure Values].  You can have the same table calculation in play on the view multiple times with different partitioning in each case.  To fix it, just drag a copy from [Level of Detail] and overwrite the one [Measure Values].  They should all show as 1.000 in the table now.

               

              You can check the results by setting k=50 and comparing against the median results in the table.  If you set k=80, you can check the results against the sheet I created using Tableau reference lines.

               

              Hopefully that makes things clearer.  let me know how you get on.

               

               

              And by the way - in case it makes you feel any better - it took me quite a while to get my head back around this, I haven't used this for a while and it's certainly not easy.

               

              I also took a quick look at your side-note and tried to turn the line chart in my original example into a table - but it kept breaking when I made all of the dimensions discrete.  I'll try to find time to have another look at that some time...

              • 4. Re: TCRL: Quantile
                Jason Scarlett

                It's a great day when you wake up to find the solution posted to the problem that has been plaguing you for the past week.

                Thanks Richard!

                 

                A few key lessons:

                - need detail (unique) rows .. figured as much but couldn't implement

                - "TOTAL" ... I'll have to investigate this one some more.

                - I was treating the "Compute Using" like the "Partition BY" in oracle, but since my table output table already has a number of dimensions as rows, I don't need to duplicate them here. I only need the within cell partitioning ... I think.

                - Partitioning done by "unique" identifiers ... in the original case end_dt and start_dt were suffice ... if these are not enough to identify an individual record, some sort of unique/primary key would be needed.

                 

                I altered the data slightly to include a primary key (unique) that is needed when the end_dt and start_dt combinations are not unique. I've altered the original data to create.

                 

                Where do I cast my vote to have Tableau to create a function to do this? Thanks again.

                • 5. Re: TCRL: Quantile
                  Richard Leeke

                  Glad that was useful.  A few more comments.

                   

                  Adding a unique identifier really helps a lot - so that was a good move.  Something I glossed over when I wrote that up last night was the issue of sorting within each partition.  Depending on the structure of your data you can end up having to include some of your partitioning fields in your addressing fields list and then use the Restarting Every drop down.  I think that having a unique identifier eliminates the need to worry about that.  I referred to this issue in the original posting with a link to a thread where Ross Bunker partially explained it.  I tried that out again on your data last night and couldn't get it to work - but didn't spend too long on it.  I never really got my head around that properly when Ross explained it (Ross promised that that would be explained fully in a detailed user guide he was writing - but that was six months ago and we're still waiting, so I'm guessing that user guide is getting quite big now).  ;-)

                   

                  > Where do I cast my vote...?

                   

                  There have been lots of postings on the forums over the years on this.  It's a difficult problem because the calculation needs access to all rows, so it doesn't scale well doing on the Tableau side - but most database engines don't support it as an aggregate function.

                   

                  Oracle does support it natively, so you can call that with RAW_SQL functions if you are using Oracle.  Under the covers the Tableau data engine also has a general quantile function - but that is not (yet!) exposed through the UI (and you can't use RAW_SQL with the data engine - so there's no way to get at it - I just happen to have spotted that the data engine implements median with a call of quantile([X], 0.5).

                   

                  Bear in mind too that Tableau will do the calculation for you if you pull all rows back by using reference lines - but this doesn't expose the results in the way that you often want to use.

                   

                  Here are a few threads discussing the issues and options.

                   

                  http://www.tableausoftware.com/support/forum/topic/plotting-quartiles

                   

                  http://www.tableausoftware.com/support/forum/topic/calculating-percentiles

                   

                  http://www.tableausoftware.com/support/forum/topic/calculating-percentiles-rawsql-or-custom-sql-connections

                   

                  Maybe add your voice to one of those or start a new thread in the Wouldn't it Be Nice If forum.

                   

                  Personally the things I would like to see, given the architectural constraints discussed in those threads are:

                   

                  1) Expose QUANTILE() as a function through the user interface for data sources that support it natively (Oracle and the data engine to my knowledge, perhaps others).

                   

                  2) Add a WINDOW_QUANTILE() table calculation function to support direct calculation in Tableau for datasources which don't support it (to save others having to do battle with the complexities you have just been fighting).

                  • 6. Re: TCRL: Quantile
                    Richard Leeke

                    I took another look at the issue I mentioned above about needing to sort each partition independently.  The issue is that the list of values needs to be sorted per partition in order for the quantile calculation to work (it has to find the Nth value in each partition, sorted in the order of the value for which the quantile is being evaluated).  But the sort order specified in the table calc advanced dialog works on the specified aggregate over the whole result set - not per partition.

                     

                    The solution I gave to this in the original TCRL posting was to add any partitioning fields necessary to get the required sort order to the top of the list of addressing fields (ie the Compute Using box), and then set Restarting Every to be the last of these extra fields.  This causes the rows to be sorted within each partition.

                     

                    That worked fine in the example I posted originally, but does not work in the example Jason posted and also doesn't work (for exactly the same reason) if you try to transform my original workbook to display a table.  After poring over this for a while and bouncing it off Joe, we eventually worked out what was happening (with the help of a shared desktop session).  I'll attempt to explain - but it's at the outer limits of both of our understandings, so this isn't going to sound very convincing!

                     

                    In each case the problem seems to be caused by something Tableau does to help with table calculations, when the data is presented in a tabular format with addressing fields on either Rows or Columns.  Ross Bunker explains about "padding the domain" The specified item was not found..  The trouble is, this has the effect of inserting lots of extra NULL rows into the partition - so the quantile calculation picking the Nth row no longer gets the correct row.

                     

                    Joe came up with a solution: create a calculated field which concatenates all of the partitioning fields, add that to LoD and make it the first addressing field.  There is now no longer any need to use Restarting Every.  This effectively achieves the sort within the partition, but in some way I don't quite understand avoids the domain being padded.  Possibly it just prevents Tableau from realising that one of the Rows or Columns fields is also effectively an addressing field.  Or something like that.

                     

                    I've attached an updated version of Jason's workbook, showing Joe's solution.  Note that in this case, the answers are exactly the same as the original sheet (which does not worry about the sort order) - this is just down to the data distribution in the small sample set of data.

                     

                    There is also a sheet showing what appears to be another side-effect of "padding the domain" - Tableau gets confused about how many result columns are expected and throws an error (which I've logged).

                     

                    I've also attached an updated copy of my original sample workbook, where the sort order is important.  This includes three tabular sheets.  The first has a continuous Date dimension on columns and works fine.  In the next sheet I've converted this to discrete, which for some reason I don't really understand results in the domain being padded, which breaks the calculation.  The third sheet uses Joe's composite field sorting trick to fix it up again.

                    • 7. Re: TCRL: Quantile
                      guest contributor

                      I am VERY fresh to Tableau and I have a requirement to create a bar graph with deciles running across the X axis based on Sales and count of Territories that fall into each decile on the Y axis.  Is the post above going down the correct road?  I have tried to adapt it to my needs, but unsuccessfully thus far.

                      • 8. Re: TCRL: Quantile
                        Joe Mako

                        Jay,

                         

                        It would be best if you start a new thread/topic in one of the other forum subject areas, like at http://www.tableausoftware.com/node/add/forum/97 and when you do, assistance can be provided if you provide a sample data set and describe what you want for a result, or mock up an image.

                        • 9. Re: TCRL: Quantile
                          Richard Leeke

                          I've come up with a much better way to calculate quantiles, so I'll be posting a new TCRL thread about that sometime soon.  In the meantime, there is an example on Public in this posting on my company blog.

                          • 10. Re: TCRL: Quantile
                            D W

                            Subscribing to thread. Can't wait to see the new method, Richard.

                            • 11. Re: TCRL: Quantile
                              Richard Leeke

                              I've written up the new version here.

                               

                              I think the new version will be better in all cases, so don't use the method described in this thread.

                              • 12. Re: TCRL: Quantile
                                Matt Cline

                                I'm having some difficulties with this method of finding quartiles. I've been trying to implement this method with no avail. Any help would be greatly appreciated. Attached is my current workplace. I'm trying to view quartiles by [stream total]. The logic for this operation is in the calculated field called 'quartiles'. Thanks.

                                • 13. Re: TCRL: Quantile
                                  Richard Leeke

                                  I'll have a look when I get a chance.

                                   

                                  I really recommend you use the new version described in Table Calculation: Quantile (version 2) rather than the version in this thread.

                                   

                                  I haven't looked at your workbook yet, so I'm not sure which one you've used.  If you've used the original version then first try switching to version 2 and report back.

                                  • 14. Re: TCRL: Quantile
                                    Richard Leeke

                                    I see you were already using the new version.  You had got the calculation defined correctly, but as you didn't have the calculation in use anywhere, I can't see exactly what trouble you were having with it.  I expect it will have been in getting the set-up of the sheet and the partitioning definition for the calculation correct - that's always the hardest part with table calculations.

                                     

                                    I've added a sheet showing how to use it and added that on to the dashboard so you can compare the values with the results of Tableau's reference lines.  It gives the same answer, which was a bit of a relief.  ;-)