10 Replies Latest reply on May 7, 2013 2:25 AM by Richard Leeke

    Creating  set over a large table

    Ertan ERISIK

      Hi there,

       

      I am currently working on customer with  a large data warehouse implementation and trying to figure out how Tableau will fit into their requirements. One of the business requirements was to identify those customers who didn't make any purchase in the last three months but did so in previous period: "customers who left us". I tried to create a set, who made purchases in the last three month so that I could apply this set into the the worksheet of those customers who made purchases in the previous months to simulate "not in" sql operator. My selection narrows down the number of customers who made purchases recently to something around 50K, but from this point on, when I want to put these into a set, Tableau is trying to extract whole customer table,  probably to apply the filter in memory. But the customer table has 20+M rows, and Tableau desktop is failing with out of memory error while trying to get this data.

       

      I would appreciate any help or suggestions for the questions below:

       

      1. I am on the right track for the scenario above ? i.e. create a set and then apply it to another worksheet to simulate "not in". Any other alternative methods or approaches?

      2. Is there any way, to tell Tableau to apply the set creation criteria in the data source? without bringing whole table.

      3. From time to time, I am facing with discussions talking about  100+M rows extractions, but in my case one field select is causing Tableau to consume whole memory. I also tried to create extracts with 20M rows, but failed to do so. What is the method of extracting large data sets in Tableau desktop without causing memory problems?

       

      Unfortunately I cannot send the large data set since it contains sensitive customer information. I created a similar report with demo data, but, due to data size, the strategy worked perfectly there. So, it seems I need to find a way to bypass the sizing problem.

       

      thanks a lot in advance for any suggestions.

        • 1. Re: Creating  set over a large table
          Tamas Foldi

          Maybe not the nicest, but obviously the fastest solution is to use "analytic functions" in your SQL to identify if a customer purchased in a different time window (by using RAWSQL in a calculated dimension which could be used as a filter). What is your database engine?

          • 2. Re: Creating  set over a large table
            Ertan ERISIK

            Hi,

             

            The database engine is Teradata.

             

            Thanks a lot for the suggestion, I will give a try but I would prefer a solution without using SQL.

             

            Best Regards

             

            Ertan

            • 3. Re: Creating  set over a large table
              Tamas Foldi

              I think you can try it with pure Tableau table calculations as well instead of rawsql. Could you send your workbook? I know it has a small dataset but just to know the kind of data

              • 4. Re: Creating  set over a large table
                Ertan ERISIK

                Tamas Hi,

                 

                Thanks a lot for the response. I am adding the workbook that I have created with sample dataset, which is acutally very similar to what I am trying to achieve in the real environment.

                 

                What happens in real environment (a Teradata impelemntation)  is as follows:I can create "first three months customer" set, and then I can also create "second three months customer" set, but after that the set operations either finding intersection or difference of them by combining sets is failing with "invalid cursor state" error. This might relate with the volume of data, but I am not sure.

                 

                The thing is,it will be very useful, if I can find a way to handle this situation without sql embedding.

                 

                Best Regards,

                • 5. Re: Creating  set over a large table
                  Ertan ERISIK

                  Hi all,

                   

                  I would really appreciate if I can get any answer from the community about the workbook I posted to understand that I am doing it correctly.

                  If you try to create large sets, then out of memory error, even though the resulting set will be constructed of small number of rows, Tableau still tries to bring every row into memory and fails with out of memory.

                   

                  I also need some information and tips to use Tableau in large scale data  warehouse environment, as in my original post, I sometimes face with  people claiming working with 100+ million rows using Tableau,but I do  not find any way to make it possible.

                   

                  Thanks a lot in advance for any suggestion.

                  • 6. Re: Creating  set over a large table
                    Jonathan Drummey

                    Pinging Robert Morton and/or Richard Leeke for help on this one...

                    • 7. Re: Creating  set over a large table
                      Robert Morton

                      Hi Ertan,

                       

                      I'm sorry to hear about the problems you're experiencing, and I hope that we can help resolve this quickly. Since the problem appears to be nuanced, I would like for you to please contact our excellent Support team (support@tableausoftware.com) so we may work with you to collect log files and explore the problem in more detail.

                       

                      Thanks,

                      Robert

                      • 8. Re: Creating  set over a large table
                        Richard Leeke

                        My top-of-the-head first thought on an approach to answering your question which ought to scale well by doing all the heavy lifting in the database is this:

                         

                        Create yourself a couple of calculated fields defining whether or not a given purchase falls within each of the date ranges of interest (past three months, previous three months, say). I would define those as two boolean fields, called something like [Is in last 3 months] and [Is in Previous 3 months]. They will just need a bit of date arithmetic.

                         

                        Then define a couple more calculated fields, counting the number of orders within those two date ranges. So you might have:

                         

                        [Number of Orders in Last 3 Months] = SUM(IIF([Is in last 3 months], 1, 0))

                         

                        and similarly for the other date range. You could actually roll the boolean and the aggregate into a single calculation for each date range, but I think it's clearer doing it in stages and the boolean can come in handy.

                         

                        Then just drop customer ID on the rows shelf (say) and your two count fields on the filter shelf. Set filter criteria to find customers where the number of orders in the past 3 months is 0 and the number in the previous 3 months is > 0.

                         

                        I've attached a Superstore Sales example (for This Year, Last Year for simplicity of date calculations).

                         

                        I think that should scale well and not bring too much back to Tableau from the back end.

                        • 9. Re: Creating  set over a large table
                          Bruce Segal

                          Hey Richard:

                          Two questions for you about the formulas in the workbook

                           

                          First, the formula:

                          DATETRUNC('year',[Order Date]) == DATETRUNC('year', DATEADD('year', -1, TODAY()))

                           

                          Do I understand that the operator you use == is the same as = ? If so, I figure you have a good reason to use == instead of =. What's your thinking on that?

                           

                          Second

                          For that same formula, it actually may be inaccurate b/c each row of data represents one "Row Id" on an order. And an order may have one or may have more than "Row Id" on it. This is yet another reminder for me how it's always critical in Tableau to know the granularity of the underlying data.

                           

                          I recognize you just whipped up a solution to help so I'm not at all criticizing you. I often encounter this issue where my data might be more granular than I want.

                           

                          Here's the table in the workbook as you show it. I show only customer ids 2 thru 12. Look at the row for Customer 6. It shows what looks like 3 orders last year.

                           

                          Data as shown in RLs orig workbook 2.jpg

                           

                          The image shows the formula above returns 3 orders for customer 6 for last year. I believe the correct answer is customer 6 had only 1 order last year and 0 orders this year. When I looked deeper at the data structure I discovered that the formula returns the number of "Row Id's" across all orders in one year, not the number of orders id's.

                           

                          See image showing one example for Order Id 6, which has 1 order with 1 line in 2011 and 1 order w/ 3 "Row Id's" in 2012. Is there a way to modify the formula to count distinct order numbers in the time periods? I tried doing something like summing the countd of Order Id but couldn't get it to work.

                           

                          Formula to extract if Order is this or last year.jpg

                           

                          Message was edited by: Bruce Segal (added detail about how to modify formula so it counted distinct order numbers in a year, when the underlying data is more granular than that.)

                          • 10. Re: Creating  set over a large table
                            Richard Leeke

                            First the easy one. The "==" operator in Tableau is just a synonym for the "=" comparison operator.

                             

                            Some languages that I use a lot distinguish between "=" meaning assignment (so "[A] = 1" would mean assign the value 1 to the variable [A]) and "==" meaning comparison (so "[A] == 1" would return the boolean result of the comparison). In those languages you can construct statements where the meaning of the operator wouldn't be clear from context, so you need to be explicit. That ambiguity doesn't exist in Tableau - I just do it out of habit.

                             

                            Good catch on the second one - as you figured I just grabbed some sample data and knocked something together in no time.

                             

                            The easy (for me!) fix is just to redefine what the example is showing - namely the number of order rows made by each customer who placed orders in the previous period and not the last period. Your origional question was only asking about identifying customers, not counting orders. 

                             

                            If you really do want to count orders and your datasource really is at the order item level there are two ways that spring to mind. Firstly you could define yourself a new, less granular, datasource. The other way off the top of my head is to use datablending to join to a copy of the datasource, but with a relationship defined to filter the blended datasource to just those in the previous period. You should be able to define a calculated field such as this:

                             

                            [True] with a formula that always returns true (eg "1 == 1 ").

                             

                            Then you could blend that to the last period calculation.

                             

                            Then use COUNTD([Order_Id]) on the secondary datasource.

                             

                            Should work, haven't tried.