1 2 3 4 5 6 Previous Next 86 Replies Latest reply on Dec 12, 2015 6:58 AM by Madhusudhan Khambham Go to original post
      • 30. Re: Table Calculation: Quantile (version 2)
        Joel Mullis



        Did not mean to confused.  I don't know the difference between packaged and unpackaged.  Just saved the file as a workbook and forwarded.


        I just tried to save a workbook with 45K lines of data but it only shows 10K.  How do get all 45K lines in Tableau?



        • 31. Re: Table Calculation: Quantile (version 2)
          Richard Leeke

          Use the File->Save As menu option and select Save as type: Tableau packaged workbook.


          That will include a copy of the your data in the file.


          Depending on your data source you may need to create an extract (it will bundle a text file or a spreadsheet into the packaged workbook, but you'll need an extract if you are connected to a live data source).


          Only do this with data you are happy to share publicly, of course!

          • 32. Re: Table Calculation: Quantile (version 2)
            Joel Mullis



            Let's try this one more time. 


            Attached is a workbook with the data I am working with.  I can't seem to get the paid loss quantiles to work.  You will notice that I created a measures view to have all of the drivers displayed at once instead of a variable to choose function.


            In addition, the calculations take over a minute to refresh.  Is that a normal speed for this amount of data.


            Thanks so much for your help and your patience with me.



            • 33. Re: Table Calculation: Quantile (version 2)
              Richard Leeke

              I just had a quick look and I can see a bit of what is going on - but there seems to be some deep dark magic in here which I haven't fully got my head around.


              The way that you have the data structured actually means that there are about 171,000 rows in the data extract (4 times 42,000 rows). What I don't quite understand yet is why there are over 1 million marks on display in sheet 1 (look in the status bar in the bottom left hand corner).


              In terms of the missing values paid loss - those blank entries in the grid are actually lots of nulls (drag the cursor over one of the blank places on the grid and you will see Tableau saying that you have just selected tens of thousands of rows).


              I'm not going to spend ages working out why your calculations aren't working, but I will have a go at recasting the workbook to use the approach to calculating percentiles which I describe at the top of the thread. Having a look at the distribution of numbers I think that will be much quicker (brave words). I'll do that as my brain-teaser this evening.

              • 34. Re: Table Calculation: Quantile (version 2)
                Richard Leeke

                Well this was a gnarly one! There was lots of interesting stuff going on here which had me scratching my head for a while. At least one of the things that was wrong looks like a mistake I introduced for you when I looked at this a few weeks ago!


                I've attached a version of your workbook with a sheet which refreshes in about 6 seconds and (I hope) gives the correct answer, plus a couple of sheets which help explain what was going on. I'll explain the issues and (for the benefit of anyone else reading the thread) a little bit about the technique you are using to allow you to perform the same set of calculations across multiple measures without having to repeat all of the calculations.


                This is going to turn into a monster post, again, I can just sense it...


                1) The first thing which really confused me for a while was that your data source has a field in it called [Number of Records]. The presence of this field means that the standard Tableau calculated field [Number of Records] (which just returns a value of 1 for each row) is not present. So if you use the expression SUM([Number of Records]) expecting to find the number of rows you get a completely wrong answer. Look at sheet [Number of Records]. There are actually 171,096 rows in the data source (4 copies of your 42,000 rows, as explained below), but SUM([Number of Records]) gives 3,659,315,700. CNT([Number of Records]) shows the correct number of rows. That's a bit of a trap I haven't seen before.


                2) Your original data source has 42,000 rows, but as you want to work with various calculations (min, 1st, 2nd and 3rd quartile, max and average) for each of 4 different measures, you have defined a custom SQL connection which returns all 42,000 rows 4 times, once for each of the 4 measures you want to work with. You have a single calculated field (called [Variable]) which returns the appropriate measure, depending on the value of the [measure] field, which has the names of each of the 4 measures you want to work with.


                I presume you got this technique from Joe - he uses this frequently. I'm ambivalent about this approach. It certainly avoids repetition of calculated fields with the associated chance of inconsistent definitions (roll on user defined functions!). But it increases the number of rows returned to Tableau, which can have a speed impact. It also increases the physical size of your data extract or packaged workbook considerably if you are using those.


                Anyway, I stuck with that approach for the workbook I've posted back.


                3) With 4 copies of your 42,000 rows you now have 171,000 rows in your data extract. The approach you are taking to calculate the quantiles requires you to return all rows from the data source to Tableau, so all of the calculations have to process all 171,000 rows. As you have 6 calculated fields in the view, using [Measure Values], that means 6 times 171,000 = 1,026,576 values to be displayed. If you look at the status bar in the bottom left hand corner for Sheet 1 you will see it says it has that many marks on display.


                Obviously it doesn't look as if you have a million marks on display - it looks as if you have a grid of 72 marks (3 divisions by 4 measures by 6 calculations), some of which are mysteriously blank. But actually what you have is lots of marks on top of each other in each of those cells, with all but one of the marks being NULL.


                The reason for all the NULL values is because you are using a technique for speeding up table calculations in situations like this, where you only want a single value to be displayed from a large number of underlying rows. The IF (FIRST()==0) THEN which wraps each of your calculations causes the calculation to return an answer for the first row in the partition and NULL for all other rows. But as you have it, Tableau still thinks that it has all those rows to display.


                I usually define a filter to suppress all the unwanted rows from the display - although in practice you don't need to and it doesn't make a lot of difference to the speed of refreshing the view. Your speed problem isn't that you are trying to draw the 1 million marks, its that you are processing 171,000 rows for each of 6 calculations. I'll come to that in a minute, lets work through why the results are wrong first. ;-)


                4) The first thing which stands out in looking at your [Sheet 1] results is that the second quartile (median) calculation is clearly wrong:


                Wrong Median.PNG


                The 2nd quartile can't be less than the first quartile.  This turns out to be an error I introduced when I was speeding up an earlier version of this for you a few weeks ago. I noticed that you were using the WINDOW_MEDIAN() function and I thought I would speed it up for you by changing it to use the MEDIAN() function instead. MEDIAN() is calculated in the data engine and is really quick, WINDOW_MEDIAN() is calculated in Tableau and takes much longer. But given the layout of the data, I should have made that TOTAL(MEDIAN()) - to force the median to be calculated over the whole partition. As I have it, the calculation is just done individually on each row (which just returns the value of that one row), and then the enclosing IF (FIRST()==0) just returns the value from the first row - which is why you are getting the same answer as for the minimum. a couple of rows above.


                Apart from the median, the answers are correct for Cycle Time, but not for any of the other measures.


                5) The issue you raised was that the Paid Loss Quantiles are (mostly) blank. There are two factors contributing to that. Firstly, most of the values for Paid Loss are NULL (see sheet [Counts of NULLs]), so the blank cells are ones where the value that has been picked out by your quartile calculation are NULL.


                Now look at sheet [Why Paid Loss is Wrong]. In that sheet I have showed all the values of Paid Loss for the WEST division, sorted in the order you have them in your table calculations - which is the order that the data source returns the field [Complete String 1]. For the quartile calculation to work the values need to be sorted in ascending of order of the value for which you are doing the calculation ([Variable] in this case). So you are essentially just picking out random values.


                6) What this also highlights is that the quartiles are actually wrong for Paid Expense and TCO, too - and in fact they are really only right by chance for Cycle Time, because the data source is returning the rows in the order of the numeric prefix on the [Complete String 1] field (which is Cycle Time). If you explicitly sort by that field it does a string sort and you will get 1.X, 10.X, 100.X, 2.X, 20.X instead of 1, 2, 10, 20, 100, but with it returning in data source order you happen to get what you need.


                I hope this doesn't come across as critical - I'm not meaning to pull this apart, just explain what was going on. It took me a lot of doing to unpick it and I'm sure it's been driving you up the wall! So after all that, how to get what you want.


                7) I've shown how to do it using the Quantile calculation from the start of this thread in the sheet called [Answer].


                The way this works is that instead of returning every single row from the datasource, it just returns one row for each distinct value of the field you want to calculate the quantile for for each permutation of dimensions, containing a count of the number of occurrences of that value. That reduces the number of rows that Tableau has to process - in this case from 171,000 to 14,000. I've explained that in detail at the start of the thread, so won't repeat that here, but I'll just note a few points.


                • The calculation as given at the top of the thread allows the percentile to be a parameter, so caters for the end cases (0 = min() and 1 = max()). As you have fixed quantiles in each field, I've cut each calculation down to the minimum (i.e. the min(), max() and quartile calculations each contain a part of the original calculation)..
                • As [Number of Records] is a trap for young players in your data source I replaced SUM([Number of Records]) with SUM(1).
                • I added a filter to eliminate the NULL Values of [Variable] before calculating the quartiles. This is something I didn't think about when I first posted this approach - but that is all you need to do to get the right answer.
                • I added a filter to keep only the NON NULL values of one of the calculations (Min, but it could have been any of them). That gets rid of everything except the first row in each partition, so if you look in the status bar it shows 72 marks.


                And that's all there is to it. 

                • 35. Re: Table Calculation: Quantile (version 2)
                  Joel Mullis



                  Thank you so much.  This is exactly what I need.  In addition, thanks for such a detailed explaination.  I have figured out the paid loss null issue with my data.


                  Last question, Is there a way that I can update the speadsheet that is the source document and tableau just pull in those update?



                  • 36. Re: Table Calculation: Quantile (version 2)
                    Richard Leeke

                    Just refresh your data extracts. Use the Refresh all Extracts option from the data menu.

                    • 37. Re: Table Calculation: Quantile (version 2)
                      Joel Mullis



                      I refreshed the data on the workbook you sent and the data displayed is correct.  I figured on the problem with Paid Loss on the excel file.  Thanks.


                      When I started working on another workbook after incorporating the changes you suggested, the quantiles displayed is accurate accept I can't get but one division to display.  I checked the filter.  I refreshed the data.  Can't figure it out.  Any suggestions.


                      I have attached your workbook with the refreshed data and the one I am working on with the same data.



                      • 38. Re: Table Calculation: Quantile (version 2)
                        Richard Leeke

                        Would be much better to attach the packaged workbook for the new one (twbx not twb).


                        I know it references the same spreadsheet, so in principle I could dig the extract of that spreadsheet out of the first workbook and hook it up to the second one, but there are all sorts of pitfalls with doing that. A data extract embodies quite a lot of state information from the data source, so even subtle changes in calculated field definitions can mean that the two workbooks expect the extract to be in slightly different states. I haven't got time to go figuring that out.

                        • 39. Re: Table Calculation: Quantile (version 2)
                          Joel Mullis

                          Sorry about that.  I thought I did.  Here are the packaged workbooks.

                          • 40. Re: Table Calculation: Quantile (version 2)
                            Richard Leeke

                            You didn't have the partitioning set right on the filter on Minimum - it needs to be set the same as on all the table calculations (i.e. Compute Using [Variable]). That filter (and the one on [Value]) should be set to include Non-NULL values only.

                            • 41. Re: Table Calculation: Quantile (version 2)
                              Li Chang

                              Hi, Richard:

                              Could you kindly let me know if the worksheet that appeared on the top of post is the final fruit of this wonderful quantile discussion? Thank you very much!



                              • 42. Re: Table Calculation: Quantile (version 2)
                                Richard Leeke

                                Yes, the calculations haven't changed, so quantile_version_2_sample.twbx will contain examples of using this calculation. Several of the other posts in the thread have examples which I've posted back for people, too, so having a look at some of those may also be helpful.

                                • 43. Re: Table Calculation: Quantile (version 2)
                                  Li Chang

                                  Thank you, Richard. You guys are awesome!

                                  • 44. Re: Table Calculation: Quantile (version 2)
                                    Jay Chang

                                    I'm sorry to bring up this thread again but I am having a hard time conceptualizing the calculated field that is at the heart of this posting.  I believe I'm trying to accomplish something that Richard's solution would facilitate but I cannot seem to get it to work properly.  I've attached a workbook that contains a some data that might be helpful, as well as my modification to Richard's formula (I basically modified it per this thread to allow me to create an n-tile for each row).


                                    The overall situation is that I am trying to identify every row in my data set with its percentile rank.  The goal is to create a version of Excel's PERCENTRANK function.  Ideally, the ranking would be dynamic such that it responds to the normal Tableau conditions.  I have included in the sample file only two sets of conditions - a period identifier and a region identifier.  So if I were to rank all my employees based on only performance period, it would split the group into however many n-tiles I specified across all employees for that performance period but if I were to include both performance period and region, the calculation should be smart enough to n-tile within the performance period and region.


                                    The workbook also contains my version of Richard's formula.  I'ts basically a direct copy, stripping out the first and last lines of the calculated field from his example file.  When I use this code, however, it does not provide any results that make sense to me.


                                    Is there any way to do what I'm trying to do?  Also, can the function be set up to return an integer value corresponding to the n-tile?  That is, if I am looking for quartiles and the percent target falls in the top quartile, it would return a 1 while if the percent target fell in the bottom quartile the calculation would return a 4?