13 Replies Latest reply on Dec 4, 2016 8:18 AM by Andrea Brice

    Math used by Tableau for Box Plot

    Shyamal Patel

      Hey All,

       

      I am trying to find the math used by Tableau to create the Box plot / get the Whiskers, median, etc.

       

      I believe its little different than the traditional math for Box plots.

       

      Can someone guide me. Appreciate your time and support

        • 1. Re: Math used by Tableau for Box Plot
          Keshia Rose

          Hi Shyamal,

           

          Compared to a standard box plot the only things that are different with the default box plot in tableau are the whiskers. The box is still made from the first, second (median) and third quartiles. However, instead of showing whiskers at the min and max data points, Tableau draws the whiskers at the farthest data points within 1.5 times the interquartile range. According to Wikipedia this is called a "Turkey Plot." Box Plot - Wikipedia. You can change the whiskers to show at the min and max by editing the box plot (Right click axis > Edit Reference Line) and changing the "Whiskers extend to" option to "Maximum extent of the data."

           

          Here's a quick example:

           

          For the Corporate Segment, the IQR is $2,690 (3rd Q - 1st Q). So for the upper whisker we look at 1.5 times that ($2,690 x 1.5 = $4,035) and add it to the upper quartile ($2,905 + $4,035 = $6,940). Once we have that number we look for the highest data point that is still less than or equal to $6,940. In this case it's at $5,817, so that is where Tableau draws the upper whisker line. Follow the reverse logic for the lower whisker. Hopefully I did my math right !

           

          Please let me know if this answers your question.

           

          Take care,

          Keshia

          2 of 2 people found this helpful
          • 2. Re: Math used by Tableau for Box Plot
            Shyamal Patel

            Thanks Keshia. I am aware of Upper / Lower whishkers and IQR calc's, etc.

             

            Its just that when i use Tableau and Excel the values for median, Q1, Q2,Q3 were coming little different.

            • 3. Re: Math used by Tableau for Box Plot
              Keshia Rose

              Which Excel functions were you using? I just did a comparison and using either QUARTILE.INC() or the old QUARTILE() returns the same exact numbers as Tableau. Could you give an example so I could understand your question better?

               

              Take care,
              Keshia

              • 4. Re: Math used by Tableau for Box Plot
                Andrea Brice

                I've been working through the reverse math myself.  I'm using R, Excel, and Tableau to check answers.  I've got the Lower Hinge, the Median, and the Upper Hinge figured out, so here's my summary.

                 

                Excel box plot throws out numbers & plot which don't match anything at all, so don't bother.  I've read they have their own special math and it doesn't seem to match any results I've been able to find.  I got a match between Tableau's boxplot & R's boxplot(WAWine$RetailPriceToUse,ylim=c(0, 65)).  However, R's boxplot will not return the same values as boxplot.stats.  R's boxplot.stats will return the standard math taught in schools, with general ideas of quartile calculations such as found in this hyperlink

                Box-and-Whisker Plots: Interquartile Ranges and Outliers

                 

                However, when I couldn't get any of those methods to give me a Q3 which matched Tableau or R, I found this article and its description of the Minitab calculation

                http://mathforum.org/library/drmath/view/60969.html

                When I used the Minitab method from this article I could get my Q3 to match - which was the number I was having the problem with.

                 

                Then that left me with the Upper Whisker.  My lower whisker is by default my minimum value, but my IQR and 1.5xIQR are not working out for the upper whisker(UW) calculation. Most places I've seen default say:  UW = Upper Hinge + (1.5 x IQR) , currently I'm seeing for (UW - UH) / IQR = .914894.

                 

                My data is obviously not normally distributed, but I haven't been able to figure out where the fractional multiplier / how the fractional multiplier to IQR is coming about. 

                 

                 

                https://public.tableau.com/static/images/wi/winedotcom/boxandwhiskers/1.png

                • 5. Re: Math used by Tableau for Box Plot
                  Shyamal Patel

                  Hi Andrea - Thanks for sharing this...I agree with most of the things you have mentioned here. I have been doing some research myself and was able to get a better understanding of Why the differences

                  • 6. Re: Math used by Tableau for Box Plot
                    Bora Beran

                    Per Tukey's definition of schematic box plot, hinges are computed based on letter values. Upper Hinge and lower Hinge are not always the same as 3rd and 1st quartile values (while depending on number of points you have in your table sometimes they coincide).

                     

                    Here is a quick summary of how letter values and hinges are computed:

                    http://www.stat.berkeley.edu/~gido/Exploratory%20Data%20Analysis.pdf

                     

                    Hinges: The median splits an ordered batch in half. We might naturally ask next about the middle of each of the halves. The hinges (H) are the summary values in the middle of each half of the data. They are denoted by the letters LH (=lower hinge) or UH (=upper hinge) and are about a quarter of the way in from each end of the ordered batch. Each hinge is at depth d(H): d(H) = [d(M)] + 1 2 where [x] means that you take the largest integer value smaller than or equal to x (i.e. if d(M) contains an 1 2 , you drop it).

                     

                    Quartiles: The hinges are almost similar to the quartiles, which are defined so that one quarter of the data lies below the lower quartiles and one quarter of the data lies above the upper quartile. The main difference between hinges and quartiles is that the depth of the hinges are calculated from the depth of the median with the result that the hinges often lie closer to the median than do the quartiles. Note that the hinges equal the quartiles for odd n, and slightly differ for even n.

                     

                    From a statistical correctness point of view, Tableau does it the right way (the way Tukey defined it). Devil is in the details and surprisingly many data analysis packages do box plots wrong and much of the online definition you will find about box plots are also inaccurate to the point it is really easy to get confused about what is right and what is wrong.

                    2 of 2 people found this helpful
                    • 7. Re: Math used by Tableau for Box Plot
                      Łukasz Majewski

                      Aside this discussion - it would be great if tableau allowed hiding outliers and zooming to what is between whiskers. Sometimes you get them so much off that the box itself becomes invisible.

                      • 8. Re: Math used by Tableau for Box Plot
                        Andrea Brice

                        If I'm understanding you correctly, I get around this by adjusting my y-coordinates.  By right clicking on the y-axis, Edit Axis, and setting the limits to "fixed" for the range I want, I visually ignore my outliers, but can see the box more clearly.

                        • 9. Re: Math used by Tableau for Box Plot
                          Łukasz Majewski

                          Right, but the thing is a fixed axis will not adjust to new data or new

                          filter setup. So you have to manualy adjust it for each selected set of

                          marks. Last year I discussed it here and my approach was to use my own stat

                          calcs to draw boxplots and have a filter which hides outlier on request.

                          But I guess it would be very easy to implement as boxplot feature.

                           

                          https://community.tableau.com/thread/179866

                          • 10. Re: Math used by Tableau for Box Plot
                            Andrea Brice

                            Bora,
                            Thank you for the additional paper.  Unfortunately, that didn't answer my central question, so hopefully I can clarify.  I'm not getting the expected result for the Upper Whisker calculation (also called the InnerFences where UH + 1.5 ∗ (H-spread)  where H-spread is the IQR) .  Tableau's Upper Whisker calculation is matched by R's quantile function. Oddly enough, R's boxplot function uses a straightforward quantile calculation and returns the same UW (59.99) that is the Tableau result - even though the Upper Hinges vary.

                             

                            They are both using what I've seen called the Minitab version of the calculation, where the fractional part of the calculation:  where n = 178
                            0.75*(n+1) = 134.25  (3rd quartile = .75, 1st quartile would be 0.25).  You then round down to 134 = U1 and 133 (U1-1).  The values in those two positions were $38.99 & $39.99.   The difference between them being 2.  Again, taking the fractional part of 134.25 (i.e., .25) and multiply it by that difference ( .25 * 2 = 0.5) which is added to the value found in U1 of $38.99.  $38.99 + 0.5 = $38.4

                             

                            The R Boxplot method is the only one which reports the value found in field 134, which is $38.99.  But here is where it gets difficult.

                            My Lower Hinge = 14.99, my Upper Hinge = 38.49, UH - LH = 23.5

                            23.5 x 1.5 = 35.25

                             

                            UH + 35.25 =

                            38.49 + 35.25 = 73.74

                            or

                            38.99 + 35.25 = 74.24

                             

                            no way = 59.99

                             

                            So, where is this $59.99 coming from?

                             

                            I checked with other Varietals.  And while the difference between the expected UW & the reported UW in both R & Tableau don't match, the variance is smaller.  There is something going on with that Upper Whisker calculation and I believe it has to do with the multiplier of 1.5 not being a simple multiplier.  I'm not a statistician and I've not been successful in my search for the variant calculations for IQR * X, although they obviously have to exist.  Tableau & R match consistently across multiple checks.  But why... I dunno

                             

                            ActIQRmult = The actual IQR multiplier where I take the BoxPlot or Tableau (UW-UH) / IQR  to get X (the multiplier which is supposed to be 1.5, but isn't).  As you can see, those values for X are not consistent. 

                            UWDiff is the difference between an expected UW using IQR*1.5 and what is reported.

                             

                             

                            Tableau
                            GeneralStatsMean53.54
                            VarietalnmeansdminmaxmedianskewkurtosisUpper Whisker59.99
                            Grenache17853.54101110.50447.9999924.496.05976344.20602Upper Hinge38.49
                            CompareDataMedian24.5
                            MethodLHMedUHLWUWIQRIQR1.5ActIQRmultActual.UWUWDiffLower Hinge14.99
                            BoxPlot14.9924.4938.997.9959.9924360.87574.9915Lower Whisker7.99
                            Quantile14.9924.4938.497.99NA23.535.25NA73.74NA
                            VarietalnmeansdminmaxmedianskewkurtosisTableau
                            Cabernet Sauvignon164980.16928255.97746.998999.9936.9926.94396904.7711Mean80.2
                            CompareDataUpper Whisker190
                            MethodLHMedUHLWUWIQRIQR1.5ActIQRmultActual.UWUWDiffUpper Hinge90
                            BoxPlot18.9936.9989.996.99189.9971106.51.408451196.496.5Median37
                            Quantile18.9936.9989.996.99NA71106.5NA196.49NALower Hinge19
                            Lower Whisker7
                            GeneralStats
                            VarietalnmeansdminmaxmedianskewkurtosisTableau
                            PetiteSirah8932.7315731.913789.99249.9924.994.25457224.6701Mean
                            CompareDataUpper Whisker56
                            MethodLHMedUHLWUWIQRIQR1.5ActIQRmultActual.UWUWDiffUpper Hinge37
                            BoxPlot15.9924.9936.999.9955.992131.50.904761968.4912.5Median25
                            Quantile15.9924.9936.999.99NA2131.5NA68.49NALower Hinge16
                            Lower Whisker10
                            1 of 1 people found this helpful
                            • 11. Re: Math used by Tableau for Box Plot
                              Bora Beran

                              Hi Andrea,

                              Whiskers extend to the nearest point within that range. So if you're getting 74.24 but there is no actual data point that is 74.24 but the next one within the 1.5 IQR range is 59.99, that's where the whisker will extend.

                               

                              So unless there is an exact matching data point it will always be smaller or larger depending on upper or lower whisker since it snaps to the nearest actual data point in range.

                               

                              On visual inspection of the screenshot you shared, there is a big gap between 60 and 80 with no real data point near 74.24 hence it is snapping to 59.99.

                               

                              In Box Plot drop down the option is explained as "whiskers extend to data within 1.5 times IQR" to hint at the fact that it will extend to only actual data points.

                               

                              I hope this explains.

                               

                              Thank you,

                               

                              Bora

                              3 of 3 people found this helpful
                              • 12. Re: Math used by Tableau for Box Plot
                                Bora Beran

                                Lukasz,

                                Do you mind making an idea in ideas forum for this? I am the product manager for the group that did forecasting, clustering, box plots, trendlines, lod calcs.... This is an area of interest to us as we're hearing this a lot from many customers. 

                                 

                                There are multiple definitions of outliers e.g. box plot outliers, distance based outliers, density based outliers, time series outliers... depending on the context, different methods would be needed. Hiding box plot outliers would be just one option and my sense is that you would not only want to hide them visually, you would rather want to have an option e.g. like an automatically generated set that you can use to hide those datapoints in other parts of your analysis as well. 

                                 

                                There are tons of questions on forums about outliers but no single idea. It would be good to see people votes and comments on an idea.

                                 

                                Thank you,

                                 

                                Bora

                                • 13. Re: Math used by Tableau for Box Plot
                                  Andrea Brice

                                  I do believe you have explained the meaning of life.  THANK you! I had no data points between $59.99 and $79.99, so the drop back was to the lower number.  Very much appreciate your patience with this.Bora Beran.

                                   

                                  Weirdly, the Excel boxplot did spit out $79.99 as the upper whisker, but that was nearly the only value I could get to match from that source.  blergh!