5 Replies Latest reply on Mar 1, 2016 11:30 AM by Jeff Strauss

    Creating a scatterplot from a massive data set

    Kristina Anderson

      Tableau Community,

       

      I'm currently on a mission/directive to create a scatterplot that compares two metrics (as all scatterplots do), but will allow the user to define the two metrics he/she is looking at.

       

      Essentially, a list of 100 metrics or so would be a dropdown filter or parameter for the X axis, and the same list of 100 metrics would be a dropdown filter or parameter for the Y axis. In my case, the metrics are various student assessment or demographic metrics, and each dot point would be a student. Example: The user might select "Score - Statewide Math Assessment" as the X-axis metric and "Score-Citywide English Growth Assessment" as the Y-axis metric and each of the 10,000 dots on the scatterplot would represent one student's results on those two metrics.

       

      My question is: Given the huge number of metrics that we would be essentially comparing to one another (I anticipate millions of rows...100 X-axis metrics X 100 Y-axis metrics X 10000 students = 100M rows), does anyone out there have suggestions for how to approach this? I'm looking for general ideas/approaches to querying such large amounts of data but being a fast tool for users.

       

      Our data is in SQL server, and we'll be posting our workbook to Tableau server for our users to use. Based on similiar work done previously, I fear that a live query calling parameters etc will be way too slow and that an extract would likely be way too massive.

       

      Thanks y'all!

        • 1. Re: Creating a scatterplot from a massive data set
          Ajo Abraham

          How about moving the dataset to another dedicated instance of SQL Server?  Instead of producing an extract you can create process to host this data set separately.  In that case even an average server should be able to handle 100M rows.  If you guys are okay with storing your data in the cloud, RedShift would be awesome for this and quite fast.  In both cases you would have a table with Student ID as primary key and the 100+ metrics as columns.

          • 2. Re: Creating a scatterplot from a massive data set
            Mark Fraser

            Hi Kristina

            I'm looking for general ideas/approaches to querying such large amounts of data but being a fast tool for users.

            I would create an extract, and if possible, hide/ remove any redundant fields (reduces size), and if possible pre-aggregate the data (again to save space, be reducing the row count and hopefully thereby improve performance). By pre-aggregate I mean roll day level data up to week, month,or year level.

             

            Also, (and I do this where possible) swap out long strings to short numeric values - if your student ids are ABC123456789, and you don't need to identify each student, just swap them all to 1, you'll still get the counts, volume etc but it will run faster and be smaller.

             

            Along this theme - if you are able to split/ reduce the data further by splitting into year, area, state, country etc. it would obviously have benefits on performance.

            Using the page field could also help with display/usage

             

            Pre-calcuating some of the metrics, rather than calculating on the fly will also help.

             

            Just a few ideas

             

            Cheers

            Mark

            2 of 2 people found this helpful
            • 3. Re: Creating a scatterplot from a massive data set
              Kristina Anderson

              Thanks for the suggestion! This is something I will investigate with my team

              1 of 1 people found this helpful
              • 4. Re: Creating a scatterplot from a massive data set
                Kristina Anderson

                Mark,

                 

                Thanks for the help! Unfortunately, while I appreciate the idea, our leadership team would not like it aggregated (still on a per-student level) and doesn't want to remove student-identifying information. Rolling it up (to something weekly, monthly, etc) isn't relevant to our data set, but again I appreciate the suggestions. Always good to look at it through another paradigm!

                • 5. Re: Creating a scatterplot from a massive data set
                  Jeff Strauss

                  ok, so here's my take on your quandary and what to look for.

                   

                   

                  1. Do you really need all 100 metrics or can you scale down the list?

                   

                  2. Assuming the answer is no, I initially lean toward the extract approach as the way that the data is structured internally is based not so much on # of rows (certainly it is a factor), but rather on the cardinality of the data and the sparsity of the data.  In other words, if there are common values across rows or many of a some measures are missing across the population, then these will either be compressed or be minimal.  So in other words, I advise doing some up-front data profiling work upfront in pursuit of estimating the speed.

                   

                  3. Get some fast disk within your server hardware - either SSD or even RAID-10

                   

                  4. Make sure that all your metrics are numeric.  If they are, these will take less space and therefore render faster

                   

                   

                   

                  Have you tried it yet with the entire dataset?  How does it end up?