3 Replies Latest reply on Nov 28, 2013 6:02 AM by Russell Christopher

    Creating a custom extract using data extract API

    Glen Park

      Hello everyone,


      I'm on Tableau server version 8.0.2.  I have a very large sales fact table that is tens of millions of rows and it is currently impossible to maintain a single data extract of this table for reporting, primarily because it times out after 6 hours.  I'm looking to reduce down this extract to something manageable and able to meet SLAs (hopefully around a 2 hour extract).  Recently I was made aware of the Tableau API as being a possible avenue for a solution.  Essentially this is my problem:


      The fact table is an insert/update table...meaning new records get inserted and pre-existing ones get updated.  This happens multiple times a day.  Tableau does not allow an incremental extract update by a last modified timestamp, it's an insert-only process so I'm currently stuck with having recreate the entire extract every time (not sustainable long term). 


      What I am wondering, is it possible to use the Tableau API to create an extract that will only dump and re-insert the last 6 months of data?  Essentially the majority of the original extract would be a one time run and only the latest segment would be affected in subsequent extracts? 


      I'm pretty sure I'm not the only one who has run into this problem and hoping someone has found a workable solution to this.  Any suggestions are appreciated.


      Thanks in advance!


        • 1. Re: Creating a custom extract using data extract API
          Russell Christopher

          Hey Glen -


          Tens of millions of rows shouldn't really be an issue for us - we have plenty of folks creating extracts that are hundreds of millions of rows and some into the 1-2 billion range (although that's extreme).


          There are two phases of creating an extract (an over simplification, but whatever):


          • An "Executing Query" phase - your RDBMS is executing the query it needs to and is streaming rows/columns back to us, where we store them as temporary files
          • A "Tableau does real work" phase, where we take those temp files, sort and compress them, and then stitch them into an extract


          Do you have a feel for when you're failing? If you duplicate this process in Tableau Server, which "phase" is taking the longest? If it's simply taking your database a long time to finish doing work (not uncommon), you can simply bump up the timeout setting in Tableau Server itself.


          It sounds like you're looking for Change Data Capture - The extract API won't help you here. While the 8.1 API will allow you to create a "base" extract and then APPEND new rows to it, that won't really help you if "sales" for "record 8" changed from $10 to $5, right?

          • 2. Re: Creating a custom extract using data extract API
            Glen Park

            That's correct.  I don't have a feel for where in the extract it's failing except that it started timing out a few weeks ago.  The table is growing a couple hundred thousand rows daily and extract times have been steadily increasing as the table's been growing.  Extending the timeout setting would be a last resort given that this isn't the only extract activity going on, and the server where the source table lives has a lot going on with it as well (source data pulls, fact table loading, etc.).  If anything, that activity would only increase as our data needs grow. 


            What's really needed is indeed a change data capture option that's not an insert-only incremental feature, which would greatly reduce the number of rows that need to be pulled into the extract.  It seems to be a waste to pull the entire table every time when only a few hundred thousand rows have either been inserted or had some values changed. 


            The only thing I could think of at this point is to perhaps figure a mechanism to copy a base .tde file which contains older static data into the primary directory and then do an incremental extract on top of that, then repeat as necessary.  This would shorten the incremental load time I suppose.  But that requires hands-on management periodically to reset a larger and larger base file...

            • 3. Re: Creating a custom extract using data extract API
              Russell Christopher

              Yup - a CDC mechanism in the extract engine would be nice -- but building something like that "right" takes a lot of work.


              For better or for worse, your scenario is a bit advanced and something that many (most?) of our users will probably never bump into...I'd suggest you search the Ideas forum for a request that looks similar to what you need (I'm sure there's one in there, I hear this question now and then) and vote it up.


              The problem with your idea is this:


              Original Extract has two rows:


              Record1 $10

              Record2 $15


              ...Over time, someone comes in and changes Record1 to $3 and adds Record3 at $8


              Your "change" extract is:


              Record1 $3

              Record3 $8


              If you lump these two datasets together, your Record1 value is now wrong - $13 when aggregated, not $3. You'd need to load a value in your change dataset like -$7 for Record1-- a delta....and then hope users don't inadvertently do something silly which prevents the $10 - 7 = $3 math from occuring correctly. Lots of ways this could go wrong.


              Ultimately, you'd be modeling slowly changing facts - which is what DWs are for, not in-memory caches You'd need to have a mechansim for flagging which fact is the latest/greatest and then setup fllters in Tableau to ignore the other noise. I guess this could be done, but it would be somewhat unnatural...