6 Replies Latest reply on Jun 4, 2011 12:52 AM by Richard Leeke

    My 4-hour extract

    Michael Mixon



      I've got a source file (.txt) from which it takes almost 4 hours to create an extract.  I'm curious if there might be ways to speed things up, as not only does it take a long time to refresh the extract, but I find that my overall system slows down while this extract is being created.


      The specs I can give you are: 1) the text file is approx. 800MB, 2) ~300K rows, 3) 70 dimensions, and 4) 30 measures.


      The extract doesn't include all of these fields, and I've selected 'Aggregate for visible dimensions' and 'Hide unused fields'.


      Any ideas?



        • 1. Re: My 4-hour extract
          Richard Leeke

          My first thought would be to establish what phase(s) of the extract creation are taking the time - 4 hours sounds like a long time for 300K rows, but you do have a lot of columns.  I'd have a careful look at the timings in the Tableau log file to work that out.


          How many rows and how many columns in the resulting extract?


          Off the top of my head, a couple of possibilities for the slow parts are:


          1) Execution of the query (via JET) to return the required results for creation of the extract (it could be that it takes JET a long time to generate the necessary aggregates, for example).


          2) Generation of the columnar representation of all of those columns (which is why I was asking how many columns in the resulting extract - i.e. not hidden).



          Depending on where the time is going, a couple of things it might be worth experimenting with (I'd do this on a sample of 10K rows, or so, initially):


          1) If the initial query is slow, it might just be worth trying creating the extract without selecting 'Aggregate for visible dimensions'.  The resulting extract will probably be bigger, but it might just get built faster.


          2) Something I've never tried with the data engine, but sometimes used to do back in the old days with Firebird was to create a simple extract (no aggregation, etc) and then generate a new connection directly to that extract and generate a second extract (I used that in Firebird to do custom SQL more quickly than JET could do it, but it might just help with doing the aggregation etc for your case).


          That second one would be something of a last resort - it makes management of the workbooks, refreshing extracts, etc a very clunky 2 stage process - but it might just throw some light on where the time is going.



          6.1 may also offer some relief - especially if speed rather than size of the resulting extract is the main issue.  There is a faster option for loading text files coming, though it won't offer the 'Aggregate' or 'Hide' options.  I think I recall that there are some other optimisations coming, even for the normal extract creation process, but I've no idea how significant they would be in your case.

          • 2. Re: My 4-hour extract
            Michael Mixon

            Hi Richard,


            I'm not terribly proficient at interpreting the log files, but here are the main things that jumped out at me from this morning's 4-hour extract process:


            - Start time: 8:26AM

            - End time: 12:25PM


            There's a bunch of activity from 8:26 to 8:44, where it looks like Tableau is finding and preparing all of the fields in the data set.  No individual step appears to take very long; there are just a lot of lines.


            Then there is a 3 hour gap, from 8:44 to 11:49, where it appears Tableau executed the main query.  It states that it took 11122.35 seconds to complete the query.


            Then from 11:49 to 12:16 it appears to have created the actual extract.  It states it inserted 50,310,244 values.


            Then there are a bunch of lines from 12:16 to 12:25, which look like additional extract steps (optimization, maybe?).  No individual step takes more than a second or two.


            So, clearly, the issue lies in the actual data query.  Is there anything I can look for in the steps leading up to the 3-hour gap in the log file that might shed light on why the data query would take 3 hours?  The data file is a .txt file, stored locally on my machine.


            I'll hold off on trying the second option for now.

            • 3. Re: My 4-hour extract
              Richard Leeke

              Your interpretation all makes sense - and it certainly seems to agree with my hunch about where the time might be going.


              In order to speculate on why the query is taking so long I'd really need to see the query and also understand the data distribution in the columns (and even then it would be a bit of a guess).


              As I said before, I suspect that the time may be taken in JET evaluating the grouping required to 'Aggregate for visible dimensions'.  With the number of rows and columns you have, I can imagine that taking a long time in JET.  (I always remember one occasion when I extrapolated from a progress indication to work out how long a (very complex, multi-table) JET query for a data extract was going to take to finish and finding that it was about 8 months - so I'm quite prepared to believe it can be slow.)


              What I understand that option is doing is pre-calculating aggregate measures, to reduce the size and probably speed up queries on the resulting extract.  I have never really experimented with that option, so I don't have a good feel for how much difference it makes - but it could just be that in the case of JET the cost of generating it outweighs the benefits.


              If it was me, the next thing I'd do before spending too long trying to analyse it is just test without setting 'Aggregate for visible dimensions'.  I would probably still select 'Hide unused fields' - I find it hard to imagine that that can make JET slower.


              I would generate that with a copy of the workbook and give the extract a different name.  That way you could easily do side-by-side speed comparisons of views on the two versions of the extract.  Obviously the other factors are how long it takes to generate the extract and how big the extract file is (if you even care about that).


              I'm interested to hear what you find.

              • 4. Re: My 4-hour extract
                Richard Leeke

                I had a bit of a play to see if there was anything in my hunch about 'Aggregate for visible dimensions' potentially slowing down the extract creation.


                In short, it does seem to make a difference - but to find out how much difference you'd have to try it on your data.


                First file I tried had 4 million rows, 20 measures, 20 dimensions.


                Creating the extract without 'Aggregate for visible dimensions' selected took 16 minutes.


                I then tried selecting 'Aggregate for visible dimensions'  this time it failed after about 5 minutes, with an MS JET error claiming to have run out of disk for temporary space.  This seemed surprising, because there was 90 GB free, so I tried again, monitoring free space throughout.  I never saw it drop below 90 GB free.  So unless JET calculates that it needs more than 90 GB and throws the error without actually allocating the space, I think this is almost certainly a spurious error message.  A bit of Googling showed that JET seems to display that message in various inappropriate contexts - generally associated with some sort of resource constraint, but not necessarily disk space.


                Odd that I ran out of anything, this is a Windows 7 64 bit machine with 8 cores, 8 GB of memory and 90 GB free disk space - but who knows what JET is up to.  Whatever, it does hint that it might be working quite hard, so I could easily believe that the 'Aggregate for visible dimensions' setting is taking a lot of your time.


                Out of interest, I also tried generating the same extract with the 6.1 beta (without setting 'Aggregate for visible dimensions').  That took 3 minutes.


                I then repeated the test with a smaller file (1 million rows, 10 dimensions, 10 measures).  That took 3 minutes with 6.0 without 'Aggregate for visible dimensions', 5 minutes with 6.0 with 'Aggregate for visible dimensions' and 40 seconds with 6.1 beta without 'Aggregate for visible dimensions'.



                • 5. Re: My 4-hour extract
                  Michael Mixon

                  Hi Richard,


                  I went ahead and created a copy of my workbook, and refreshed it without setting "Aggregate for visible dimensions", keeping everything else the same.  My findings are:


                  - That shaved about 30 minutes off of the refresh time.

                  - The resulting extract was about 37MB larger (144MB vs 107MB).

                  - The resulting workbook was 4MB larger.

                  - Performance between the two workbooks was identical.  No slow-down due to the non-aggregated extract.


                  So, not selecting aggregate visible dimensions does save a little time and doesn't cause any performance slow-down, but it does result in a slightly larger file.  Given the already large size of the workbook (22MB), I may opt to go with the slightly smaller version and just wait the extra half hour. :-)



                  • 6. Re: My 4-hour extract
                    Richard Leeke

                    Thanks for posting the feedback - it's always good to have some hard numbers to base things on rather than just guesses.


                    I can see why you are saying you'll stay with the status quo for now, but as you didn't notice any performance difference on the resulting workbook, it may well be worth accepting the slightly larger workbook and extract in return for the much more significant improvement in extract generation time you will be able to get when 6.1 comes out.


                    I'm still intrigued by JET claiming I didn't have enough disk space - but life's too short to go worrying about all of JET's foibles.