4 Replies Latest reply on Jul 10, 2014 7:37 AM by Steve Martin

    Performance optimisation using multiple extracts

    Steve Martin

      Hi gang,


      I have a very large and very under-performing db2 table, it forms part of the enterprise system so I am unable to make changes or request the dba's to make changes in short, I'm stuck with it and at 230m records over 29 columns made-up for primarily VarChar(100) and growing daily, it is becoming slower and slower. I have no etl options open to me to reduce the set based on segmentation data and my viz, despite utilising context filters to cause a temp table to be created on the segment of choice takes at best 12 minutes to render and at worst, over 2 hours - not the best advert for Tableau marketing I know.


      The only suitable way I can think of getting around this is to create a suite of extracts based against each segment; I do have additional dimensions I can create a context filter against to reduce the data further though as the segments are not evenly distributed, the lowest has 500k records with the largest containing around 65m records.

      The only way I can think of getting around this is to create the multiple extracts and then create multiple versions of the same viz - one per segment, layered one on top of another in the same container controlled by a parameter to show/hide sheets.


      Has anyone got any better ideas on how I can achieve this? As an example, when testing using one of the smaller segments (~2m records) rendering time was reduced from 18 mins against the main source to 43 secs.

        • 1. Re: Performance optimisation using multiple extracts
          Tom W

          Do you need that level of granularity or could you aggregate it to a higher level to reduce the amount of records?

          Are all the columns required?


          When you say it takes 12 minutes to render, is this after data has been retrieved? Or are you connecting live?

          • 2. Re: Performance optimisation using multiple extracts
            Steve Martin

            Unfortunately, this level of granularity is needed and yes all of these columns are needed (the table is actually 56 columns for which I need 29 of them). In an ideal world, I would be provided with dimension id's and dimension tables to blend inside of Tableau but this is not that case and my protestations to the contrary are falling on deaf ears which means a lot of unnecessary data being pulled from disc.


            On rendering, currently connecting live so it takes a minimum of 12 mins for the query to return (worse in the afternoon) before the viz is rendered - apologies, that was misleading - and this is why I am looking at the extracts.


            My thoughts are that they can be updated once, they would be significantly smaller but running from the beefy Tableau Server rather than the lower performing and heavier loaded db2 enterprise solution; it matters not to some degree how many users are opening the dashboard as they are not hitting the server; we can significantly reduce network latency as db2 (for updating) and Tableau Server are in Seattle whereas we are in London etc.


            Essentially, I am using the benefits of the offline [type] extract and doing that which is not available to me for the server solution by breaking-down the table by [Segment_Name].

            • 3. Re: Performance optimisation using multiple extracts
              Tom W

              Are you proposing to create a separate extract for each segment just to speed it up or because you don't need data for all segments?


              Why not just pull one extract of everything? Sure it would take more work upfront everyday, but once you've pulled it down all segments would be available.

              • 4. Re: Performance optimisation using multiple extracts
                Steve Martin

                Both but primarily from a speed perspective.


                This book came to me to optimise and I have to admit that with such tight rules I am running into blockers. It was originally setup with the segment_name (segment) field as a context filter, the previous creator realising that this would significantly reduce the data being worked-on; the segments are used only to provide a basis for the viz so switching the segment only serves to adjust the view which is why in this instance I could ultimately get away with having a single extract per segment and a the same viz duplicated once per data-source.


                I did originally look at loading all data into the extract; I had already been made aware by the head of data engineering that this had been tried but there was simply too much data to make this viable - tried this morning and got a rap across the knuckles as I was causing a bottle-neck and preventing other key areas from updating; also, given the way the data is coming in (from Hadoop) being processed through a convoluted etl process etc, incremental updates are unavailable meaning a full data refresh is needed each time.


                So, this is what has brought me to what I perceive as being the only method left - individual extracts for each segment, individual viz's running from each extract and a parameter controlling which viz is in view