3 Replies Latest reply on Dec 10, 2018 7:10 AM by Jonathan Drummey

    Optimizing performance for large blended data set

    Cameron Gocke

      I have a large data set in a MySQL data store that consists of survey data and data on the members who completed the survey.  The data set with the survey questions and answers is right at 1B rows and some of the individual surveys have 200+ rows (one row for each possible answer).  We need to slice the data up based on attributes of the member demographics (age, gender, ethnicity, etc.)  Because of the size of the data set we use a Tableau Server data extract that is updated via incremental refreshes.  We currently have 2 data sources defined, one for the survey data and one for the member data.  The problem is that blending the 2 based on the member Id is waaaay too slow.


      The only solution I've come up with is to publish the member attributes as additional columns in the survey data source.  That is going to add a lot of columns to that data source; however, and I'm not sure that is the right/best answer.


      Along a similar line, there are also scenarios where the data needs to be sliced by some of the survey values, as an example of that slicing other survey answers based on values for things like blood pressure, and weight that come up as part of other survey values.


      Should this just be one large (wide and long) data set, or is there a better way to allow joining up one data set to the other without paying such a huge performance penalty?


      Here are some screenshots of example data from the survey and demographics data sets:


        • 1. Re: Optimizing performance for large blended data set
          Patrick Van Der Hyde

          Hello Cameron,


          As this relates to survey data, I would recommend reviewing Zen Master, Steve Wexler's site https://www.datarevelations.com as a starting place for working with survey data and maybe engaging with Steve to help a bit with how to tackle a project with so much data.  I would normally recommend tall data for Tableau - with only a few columns but plenty of rows but your case is challenging with the total size of data.  My guess is the best way to handle this is to still get the data into a format better handled by Tableau via an ETL process that runs frequently and allows for storing of the data in a summary format which could easily be updated with the frequency desired.


          I hope this gets you started and please come back and let us know how things are going.



          • 2. Re: Optimizing performance for large blended data set
            Steve Wexler



            So, this is uncharted territory for me as the largest survey I've worked with is 100m rows.  There were 13,000 columns and 3,500 respondents.  When we reshaped / pivoted the data it yielded 20 columns (demographic data, question ID, question meta data) and 100m rows.  It worked very well.


            For anyone thinking "13,000 columns?", it was brand comparison survey so it was something like 100 different brands being compared in one survey.  Is that the type of thing you're working with?


            BTW, performance was quite good, but there's a big difference between 100m rows and 1B rows.


            I've always joined the demographic data with the pivoted / reshaped survey data so I have the separate columns.  Can you describe how it is that you are getting 1B rows? 


            Steve Wexler

            • 3. Re: Optimizing performance for large blended data set
              Jonathan Drummey

              Hi Cameron,


              A couple of other points to add to what Patrick & Steve wrote:


              1) You mentioned that you are using extracts...Tableau is extremely efficient at compressing data in extracts so joining the dimension table to the answers table is likely to add less size to the answers extract than you might think.


              2) That said, if you're not getting the performance that you need out of the extracts then if you're on 2018.3 you could try out the multiple tables extract option, it may or may not improve performance in this situation.


              3) If #1 and #2 aren't fast enough then I'd suggest using two data sources: your original 1B record source for the details and "before"/"above" that an aggregated data source (potentially an aggregated extract itself). This is building data sources along the lines of Ben Shneiderman's visual information seeking mantra "Overview, Zoom & Filter, Details on Demand". Users would start out using views built on the aggregated source to get an overview & zoom & filter into the data, then as they've gone far enough into the data to want to get to specific details then you'd use Filter Actions and/or cross data source filters to work with the 1B record source. The idea is that the user interaction is fast initially because you're using a smaller data source and then by the time the user is diving into details there are sufficient filters in place that queries on the detail source are also fast.