6 Replies Latest reply on May 8, 2012 4:58 AM by Jonathan Drummey

    Table Calculations Broken by Tableau 7.0 Changes

    Richard Leeke

      I've just come across a fairly extreme example of an undesirable side-effect of the changes in the way Table Calculations work in Tableau 7.0.


      The change is that Tableau now "pads the domain" (which means filling in dummy rows for all permutations of dimensions) in a lot more circumstances than it did before.


      I have just been putting together an example workbook to go with the answer to this question - which asks about finding the latest available data for each product in an inventory table. I thought I'd use the Coffee Sales sample data and demonstrate how to do it with table calculations, data blending and perhaps raw SQL. I was going to say that the table calculation method would work fine on a small dataset like the coffee sales data, but wouldn't scale to really big data sets. However, when I went to refresh the view, it took several minutes to refresh (during which time that tell-tale signs black bar where the menu should be appeared). Eventually it warned me that Tableau was almost out of memory and might crash at any moment. Sure-enough - the process was over 3 GB. Oh, and it got the answer completely wrong.


      I'm never quite sure of anything with table calculations, but I was pretty sure I hadn't done anything stupid, so I tried it in 6.1. That worked fine - the view took about a second to refresh and gave the answer I expected.


      The view just displays the latest order date and order quantity for each product (or at least it does in 6.1) and has the following fields:


      Rows: [Product Name], [Order Date], [Order Quantity]


      Level of Detail: [Latest Order] (defined as LAST()==0, Compute Using [Order Date], [Order Quantity], sorted by MIN([Order Date]) - ie true for the latest order for each product)


      Filters: [Latest Order]  = True


      In 6.1 this works fine, filtering down from 8,000 odd order rows to show only the latest order for each product - a bit over 1,000 rows.


      In 7.0 it appears to pad the domain so that there is a dummy (NULL) row for each permutation of product, date and quantity, which makes the memory blow out excessively. And now the last order date for each product is set to the very latest order date across all products - so all products except the few with orders on that latest date return a blank order quantity.


      I modified it slightly (sheet 2) to have SUM([Order Quantity]) instead of [Order Quantity] as a dimension. That reduces the number of permutations and lets it get the wrong answer much more quickly and without taking anywhere near so much memory.


      I haven't seen any sign of there being any user selection governing this behaviour, either. Anyone know of a way to intervene?


      I've attached the 6.1 and 7.0 versions of the workbook. Opening the 6.1 version in 7.0 also behaves the same as the 7.0 version (i.e. it's not that I put the 7.0 version together differently).

        • 1. Re: Table Calculations Broken by Tableau 7.0 Changes
          Jonathan Drummey

          Hi Richard,


          I tried duplicating your results and I couldn't, then I went back to your workbook to look at it more closely. While I totally agree that Tableau's v7 padding behavior can cause problems, I think in this case you ran into the issue of lack of feedback in the UI that compounded the padding behvaior situation.


          In the Latest Orders v7 workbook, on Sheet 1 (the one that takes forever and uses up memory) the Latest Order filter calc has the Compute Using set to Order Date, which is different from the Latest Order calc on the Level of Detail shelf, which has the Compute Using you specified. When I adjust the Compute Using for the Latest Order filter calc to the one you specified, the view has sub-second refreshes like we'd expect. When I set the Compute Using on the filter calc back to Order Date, then Tableau zooms up to using 1.7+GB of RAM on my machine then crashes. In fact, it does this on 6.1 as well.


          So, I suspect the problem you were running into was A) that the UI hadn't indicated to you that the Compute Using for the Latest Order calc was different for the version in the Filters shelf than the level of detail and then B) that Tableau was doing its v7 padding on the Latest Order calc on the Filters shelf that consumes memory and crashes.


          While testing this out, I added some DATETRUNC fields for Day of Order Date, Month of Order Date, and Year of Order Date, with the Latest Order calc set to Compute using whatever the DATETRUNC was. The Year of Order Date returns 30860 marks and goes pretty fast. The Month of Order Date goes up to ~880MB of RAM on my machine, has 370320 marks and takes about a minute. The Day of Order Date goes to 1.7+GB of RAM and crashes. This is all on v7.


          On v61, using Day of Order Date and Compute Using set to Day of Order Date runs just as fast and returns the same results as using Order Date.


          My thoughts right now are that:


          - There's a bug in Tableau in that it crashes when its padding too much.

          - There's a UI issue in v7 that Tableau doesn't warn us that it's about to pad and create many, many rows.

          - There's another UI issue that Tableau doesn't inform us that the table calc settings on the Filter shelf are different than the settings used for the same calc elsewhere in the view.















          1 of 1 people found this helpful
          • 2. Re: Table Calculations Broken by Tableau 7.0 Changes
            Richard Leeke

            Good catch on the inconsistent Compute Using settings, Jonathan. It's so easy to do that when you're changing settings around isn't it?


            Fixing that on sheet1 makes that behave like v6.1 as you say. But interestingly, sheet2, which is simpler, is still broken. That only has 2 dimensions in play ([Product Name] and [Order Date]) and the compute using really is set to [Order Date] on both LoD and Filters for the [Latest Order] calc field. That still pads the domain, resulting in all products showing the same last order date and mostly null order quantities. So there's still something I don't understand going on here.


            If you take [Latest Order] off filters and also remove (SUM([Order Quantity]) from rows, just to make it as simple as possible, you can see the difference very clearly.


            Here it is with version 7, with each product having a row for each date in the database, giving 1.8 million rows.




            Here it is in 6.1, just the original 8,000 rows.





            I can see that there are times when it's useful or even essential to pad the domain in order to be able to get meaningful results - but equally there are times when it's absolutely not what you want. My take on this is that control of the behaviour needs to be exposed in the UI. I know Tableau are very keen to avoid over-complicating the product by exposing too many confusing options, but I think this is one case where it's just too hard for Tableau to guess what is the required behaviour - so hiding what's going on leaves it more confusing than if it were exposed.

            • 3. Re: Table Calculations Broken by Tableau 7.0 Changes
              Jonathan Drummey

              Maybe we can get Ross Bunker to help us out with understanding the date padding?



              • 4. Re: Table Calculations Broken by Tableau 7.0 Changes
                Ross Bunker

                Hmm..the new forum feature of sending me e-mail is kind of cool...."What, did someone say my name?"  :-)

                Anyway, you guys raise a number of good points here about our behavior.  Thank you for doing so as this kind of feedback really helps us refine the user experience.


                First, the easy issues:

                --Agreed that having the same calc with different orderings in the viz can be frustrating.  I like the idea of a visual cue.  We'll put that into the hopper.

                --Crashes and data explosion with padding are something we're seeing now and then.  Again, I agree that we should be a bit smarter about recognizing a potential problem and warning you.


                Now, the harder issue:

                Padding is a long requested feature in Tableau.  Particularly for dates.  We're trying to introduce it gently and unobtrusively.  We're still refining the behavior and trying to get it right, though you are correct that we can't always guess correctly.  Surfacing it somewhere in the UI may be needed and we'll look for the best way to do that.


                One thing we did in 7.0 is to be smarter about dates when padding.  We've long padded table calculations based on the structure of the viz.  When going across the table, we'd add extra values to cover any missing things because in the table structure there would be a spot for them.  People expected values to be there (in a running total for example) even if that particular cell didn't have data.  In 7.0, we tried to recognize that dates are almost always independent, so we pad a little more aggressively.  I'm not sure that we're doing the right thing in this case.  although, if your table calc was something like difference from previous and you were only missing a couple of values, you probably would want it to pad.


                One thing that makes this case particularly unusual is that you really only have date on there so you can do a table calc and then get at the last one and filter to it.  In this case, its using table calcs to do a computation at one level of detail (date) and then filter out that level to just one row for each date.  While this is certainly a reasonable use of table calcs, it's a use case we'd really like to find other ways to enable since table calcs bring some unneeded complication into the picture.  In this case, the padding is an example of the unnecessary complication  :-).


                All that said, there is a workaround here.  Instead of Last() == 0, you really want a boolean like:  ATTR([Order Date]) == WINDOW_MAX(ATTR([Order Date])) 


                I've also been kicking around the notion of introducing some variants of FIRST/LAST/SIZE/INDEX that handle NULLs.  So you could say LAST(<col>) and it would give you the distance to the index of the last non-NULL <col>.  SIZE(), INDEX(), FIRST() could all take a column and would give you the information about the non-NULL all would return NULL for items that are not part of the non-NULL partition w.r.t. the given measure.  That would greatly simplify this sort of thing.  You could just say LAST(COUNT(1)) == 0.  While simpler, it's actually harder to understand than the WINDOW_MAX example i gave above.

                • 5. Re: Table Calculations Broken by Tableau 7.0 Changes
                  Richard Leeke

                  Excellent, feedback Ross - thanks for the crystal-clear explanation.


                  > Hmm..the new forum feature of sending me e-mail is kind of cool...."What, did someone say my name?"  :-)


                  We just all have to promise not to over-use it - only cry for help when we really need it or Ross will have to start ignoring the cries or he won't have time to get on with his real work.  ;-)


                  • 6. Re: Table Calculations Broken by Tableau 7.0 Changes
                    Jonathan Drummey

                    Thanks again, Ross, this post and your other on restart every/at the level have been really helpful. And I promise not to overuse the forum-to-email, though I can imagine being tempted!


                    The FIRST()/LAST()/etc. variants that find non-NULL's sound great, it seems like they could be useful anytime when we'd need to do a "search" through the data. I can imagine something like finding the first purchase amount for a customer and then using that to compare to the current purchase amount would be a lot easier: SUM([Sales])-LOOKUP(SUM([Sales], FIRST(SUM([Sales]))).


                    One place where I've been bitten multiple times now by padding is in text tables when going from having all dimensions on Rows or Columns to having one or more dimensions on the other shelf. The first approach makes it easier to see the table calculation in action and also use Totals to check results, while the crosstab is typically desired final layout, or a step to having some sort of bar or line chart. There are cases where calculations that produce desired results in the first option (everything on Rows or Columns) break when moving into a crosstab because of padding. However, dimensions not on Rows or Columns don't seem to cause padding, so the results can change again (to the desired results even) by moving a dimension onto the Color Shelf. I've attached a workbook using the superstore sales data that shows some simple versions of this using INDEX().


                    The issue for me with this is twofold: One is that when I believe I'm changing just one aspect of a view (moving a pill from one shelf to another) I'm really changing M+1 things, where M is the number of marks whose results are changed by the shift in padding. The second is a little more nebulous, it's again about expectation/experience. It makes complete sense that Table (Across), Table (Down), etc. are totally dependent on the visual layout, so if the pills move the marks change. In v61, when you set Advanced Compute Using then table calcs are mostly invariant to that, so my expectation has been to use Table (Across) when I'm comfortable with variant results, and use Advanced... when I want to retain invariant results. In v7, even with Advanced... the new padding behavior can return unexpected (at least until you understand it) results, and results that are more difficult to verify. I think this makes table calculations ultimately harder to learn for new users because another variable is at play - users now have to be much more aware of *where* the dimensions are in the view and know how they affect table calcs.





                    Adding one more comment to that last paragraph - I don't have any particular notions of how to resolve those issues, given that padding is evolving. And, in the scheme of things, it's a lesser issue for me than table calculation desires, like wanting functionality that removes the need for the IF FIRST()==0 and PREVIOUS_VALUE() techniques that Richard developed to return one result for a partition or the same result for all rows in a partition, or partitioning by table calculation results, or having drag-and-drop compute using, or ...