2 of 2 people found this helpful
Compared to a standard box plot the only things that are different with the default box plot in tableau are the whiskers. The box is still made from the first, second (median) and third quartiles. However, instead of showing whiskers at the min and max data points, Tableau draws the whiskers at the farthest data points within 1.5 times the interquartile range. According to Wikipedia this is called a "Turkey Plot." Box Plot - Wikipedia. You can change the whiskers to show at the min and max by editing the box plot (Right click axis > Edit Reference Line) and changing the "Whiskers extend to" option to "Maximum extent of the data."
Here's a quick example:
For the Corporate Segment, the IQR is $2,690 (3rd Q - 1st Q). So for the upper whisker we look at 1.5 times that ($2,690 x 1.5 = $4,035) and add it to the upper quartile ($2,905 + $4,035 = $6,940). Once we have that number we look for the highest data point that is still less than or equal to $6,940. In this case it's at $5,817, so that is where Tableau draws the upper whisker line. Follow the reverse logic for the lower whisker. Hopefully I did my math right !
Please let me know if this answers your question.
Thanks Keshia. I am aware of Upper / Lower whishkers and IQR calc's, etc.
Its just that when i use Tableau and Excel the values for median, Q1, Q2,Q3 were coming little different.
Which Excel functions were you using? I just did a comparison and using either QUARTILE.INC() or the old QUARTILE() returns the same exact numbers as Tableau. Could you give an example so I could understand your question better?
I've been working through the reverse math myself. I'm using R, Excel, and Tableau to check answers. I've got the Lower Hinge, the Median, and the Upper Hinge figured out, so here's my summary.
Excel box plot throws out numbers & plot which don't match anything at all, so don't bother. I've read they have their own special math and it doesn't seem to match any results I've been able to find. I got a match between Tableau's boxplot & R's boxplot(WAWine$RetailPriceToUse,ylim=c(0, 65)). However, R's boxplot will not return the same values as boxplot.stats. R's boxplot.stats will return the standard math taught in schools, with general ideas of quartile calculations such as found in this hyperlink
However, when I couldn't get any of those methods to give me a Q3 which matched Tableau or R, I found this article and its description of the Minitab calculation
When I used the Minitab method from this article I could get my Q3 to match - which was the number I was having the problem with.
Then that left me with the Upper Whisker. My lower whisker is by default my minimum value, but my IQR and 1.5xIQR are not working out for the upper whisker(UW) calculation. Most places I've seen default say: UW = Upper Hinge + (1.5 x IQR) , currently I'm seeing for (UW - UH) / IQR = .914894.
My data is obviously not normally distributed, but I haven't been able to figure out where the fractional multiplier / how the fractional multiplier to IQR is coming about.
Hi Andrea - Thanks for sharing this...I agree with most of the things you have mentioned here. I have been doing some research myself and was able to get a better understanding of Why the differences
2 of 2 people found this helpful
Per Tukey's definition of schematic box plot, hinges are computed based on letter values. Upper Hinge and lower Hinge are not always the same as 3rd and 1st quartile values (while depending on number of points you have in your table sometimes they coincide).
Here is a quick summary of how letter values and hinges are computed:
Hinges: The median splits an ordered batch in half. We might naturally ask next about the middle of each of the halves. The hinges (H) are the summary values in the middle of each half of the data. They are denoted by the letters LH (=lower hinge) or UH (=upper hinge) and are about a quarter of the way in from each end of the ordered batch. Each hinge is at depth d(H): d(H) = [d(M)] + 1 2 where [x] means that you take the largest integer value smaller than or equal to x (i.e. if d(M) contains an 1 2 , you drop it).
Quartiles: The hinges are almost similar to the quartiles, which are defined so that one quarter of the data lies below the lower quartiles and one quarter of the data lies above the upper quartile. The main difference between hinges and quartiles is that the depth of the hinges are calculated from the depth of the median with the result that the hinges often lie closer to the median than do the quartiles. Note that the hinges equal the quartiles for odd n, and slightly differ for even n.
From a statistical correctness point of view, Tableau does it the right way (the way Tukey defined it). Devil is in the details and surprisingly many data analysis packages do box plots wrong and much of the online definition you will find about box plots are also inaccurate to the point it is really easy to get confused about what is right and what is wrong.
Aside this discussion - it would be great if tableau allowed hiding outliers and zooming to what is between whiskers. Sometimes you get them so much off that the box itself becomes invisible.
If I'm understanding you correctly, I get around this by adjusting my y-coordinates. By right clicking on the y-axis, Edit Axis, and setting the limits to "fixed" for the range I want, I visually ignore my outliers, but can see the box more clearly.
Right, but the thing is a fixed axis will not adjust to new data or new
filter setup. So you have to manualy adjust it for each selected set of
marks. Last year I discussed it here and my approach was to use my own stat
calcs to draw boxplots and have a filter which hides outlier on request.
But I guess it would be very easy to implement as boxplot feature.
1 of 1 people found this helpful
Thank you for the additional paper. Unfortunately, that didn't answer my central question, so hopefully I can clarify. I'm not getting the expected result for the Upper Whisker calculation (also called the InnerFences where UH + 1.5 ∗ (H-spread) where H-spread is the IQR) . Tableau's Upper Whisker calculation is matched by R's quantile function. Oddly enough, R's boxplot function uses a straightforward quantile calculation and returns the same UW (59.99) that is the Tableau result - even though the Upper Hinges vary.
They are both using what I've seen called the Minitab version of the calculation, where the fractional part of the calculation: where n = 178
0.75*(n+1) = 134.25 (3rd quartile = .75, 1st quartile would be 0.25). You then round down to 134 = U1 and 133 (U1-1). The values in those two positions were $38.99 & $39.99. The difference between them being 2. Again, taking the fractional part of 134.25 (i.e., .25) and multiply it by that difference ( .25 * 2 = 0.5) which is added to the value found in U1 of $38.99. $38.99 + 0.5 = $38.4
The R Boxplot method is the only one which reports the value found in field 134, which is $38.99. But here is where it gets difficult.
My Lower Hinge = 14.99, my Upper Hinge = 38.49, UH - LH = 23.5
23.5 x 1.5 = 35.25
UH + 35.25 =
38.49 + 35.25 = 73.74
38.99 + 35.25 = 74.24
no way = 59.99
So, where is this $59.99 coming from?
I checked with other Varietals. And while the difference between the expected UW & the reported UW in both R & Tableau don't match, the variance is smaller. There is something going on with that Upper Whisker calculation and I believe it has to do with the multiplier of 1.5 not being a simple multiplier. I'm not a statistician and I've not been successful in my search for the variant calculations for IQR * X, although they obviously have to exist. Tableau & R match consistently across multiple checks. But why... I dunno
ActIQRmult = The actual IQR multiplier where I take the BoxPlot or Tableau (UW-UH) / IQR to get X (the multiplier which is supposed to be 1.5, but isn't). As you can see, those values for X are not consistent.
UWDiff is the difference between an expected UW using IQR*1.5 and what is reported.
Tableau GeneralStats Mean 53.54 Varietal n mean sd min max median skew kurtosis Upper Whisker 59.99 Grenache 178 53.54101 110.5044 7.99 999 24.49 6.059763 44.20602 Upper Hinge 38.49 CompareData Median 24.5 Method LH Med UH LW UW IQR IQR1.5 ActIQRmult Actual.UW UWDiff Lower Hinge 14.99 BoxPlot 14.99 24.49 38.99 7.99 59.99 24 36 0.875 74.99 15 Lower Whisker 7.99 Quantile 14.99 24.49 38.49 7.99 NA 23.5 35.25 NA 73.74 NA Varietal n mean sd min max median skew kurtosis Tableau Cabernet Sauvignon 1649 80.16928 255.9774 6.99 8999.99 36.99 26.94396 904.7711 Mean 80.2 CompareData Upper Whisker 190 Method LH Med UH LW UW IQR IQR1.5 ActIQRmult Actual.UW UWDiff Upper Hinge 90 BoxPlot 18.99 36.99 89.99 6.99 189.99 71 106.5 1.408451 196.49 6.5 Median 37 Quantile 18.99 36.99 89.99 6.99 NA 71 106.5 NA 196.49 NA Lower Hinge 19 Lower Whisker 7 GeneralStats Varietal n mean sd min max median skew kurtosis Tableau Petite Sirah 89 32.73157 31.91378 9.99 249.99 24.99 4.254572 24.6701 Mean CompareData Upper Whisker 56 Method LH Med UH LW UW IQR IQR1.5 ActIQRmult Actual.UW UWDiff Upper Hinge 37 BoxPlot 15.99 24.99 36.99 9.99 55.99 21 31.5 0.9047619 68.49 12.5 Median 25 Quantile 15.99 24.99 36.99 9.99 NA 21 31.5 NA 68.49 NA Lower Hinge 16 Lower Whisker 10
3 of 3 people found this helpful
Whiskers extend to the nearest point within that range. So if you're getting 74.24 but there is no actual data point that is 74.24 but the next one within the 1.5 IQR range is 59.99, that's where the whisker will extend.
So unless there is an exact matching data point it will always be smaller or larger depending on upper or lower whisker since it snaps to the nearest actual data point in range.
On visual inspection of the screenshot you shared, there is a big gap between 60 and 80 with no real data point near 74.24 hence it is snapping to 59.99.
In Box Plot drop down the option is explained as "whiskers extend to data within 1.5 times IQR" to hint at the fact that it will extend to only actual data points.
I hope this explains.
Do you mind making an idea in ideas forum for this? I am the product manager for the group that did forecasting, clustering, box plots, trendlines, lod calcs.... This is an area of interest to us as we're hearing this a lot from many customers.
There are multiple definitions of outliers e.g. box plot outliers, distance based outliers, density based outliers, time series outliers... depending on the context, different methods would be needed. Hiding box plot outliers would be just one option and my sense is that you would not only want to hide them visually, you would rather want to have an option e.g. like an automatically generated set that you can use to hide those datapoints in other parts of your analysis as well.
There are tons of questions on forums about outliers but no single idea. It would be good to see people votes and comments on an idea.
I do believe you have explained the meaning of life. THANK you! I had no data points between $59.99 and $79.99, so the drop back was to the lower number. Very much appreciate your patience with this.Bora Beran.
Weirdly, the Excel boxplot did spit out $79.99 as the upper whisker, but that was nearly the only value I could get to match from that source. blergh!