4 Replies Latest reply on Oct 10, 2013 6:37 AM by Sawan Ruparel

    Super Big Data

    Robert Sutter

      I'm working with an organization that would like to have ad-hoc reporting at the ready and get out ahead of some of the data needs.


      In the past I've created flexible canned reports that allow a slice-and-dice option and pre-aggregated the data so that it's smaller.


      A lot of the data used here needs to be shown at an hour level and distinct counts at varying slices of the data are needed. As soon as I pre-aggregate the data the distinct value is lost.


      Because we're talking really big data (Oracle but moving to Hadoop because of size issues) how are Tableau users extracting and processing billions of records to make ad-hoc visualizations possible?



        • 1. Re: Super Big Data
          Robert Morton

          Hi Robert,

          Several systems such as Cloudera Impala are making big advances in fast queries against massive amounts of data. However the standard Hive SQL interface will expose your queries to the batch nature of Map/Reduce, so you may not have terrific response times if that's your only option. For data that large, even Data Engine extracts may not give you enough of a performance boost at the level of data granularity you require. Consider that you may need to pay for a commercial Hadoop distribution if you need better performance than what the standard Apache distribution provides.


          • 2. Re: Super Big Data
            Russell Christopher

            +1 On Robert’s comments about Impala – you can download their “Quick Start” Virtual Machine for testing – I was pleasantly surprised with how quickly I was able to get the solution up and running for personal testing even though I rate my Linux & Hadoop stack knowledge as “somewhere south of very, very low”.  This might be a good fit for you.

            • 3. Re: Super Big Data
              Cristian Vasile



              If you are a MS Windows shop then try Hortonworks Hadoop for Windows distribution.

              Other little know distribution mixed with a SQL engine is the one produced by Pivotal (EMC & VMware); take a look here  Pivotal HD | GoPivotal

              Hope this helps.




              • 4. Re: Super Big Data
                Sawan Ruparel

                I think Tableau Product Design answers your question.


                Lets see if I understand the question correctly -

                1. You have big data and it may not be possible for you to bring in that data as extract

                2. You plan to move to Hadoop to store the data generated by the system. Now there are connectors to the Hadoop system but it you are concerned about perfomance of queries on Hadoop.

                3. You would like to keep the distinct values available for drill down if required by the user

                4. You want ad-hoc reporting capability on big data


                The way we try to solve this problem -

                1. We have knowledge matrix created from the big data that we use to show counts and aggregate level data. Knowledge matrix can be created by understanding the dimensions and measures in your data. Hopefully they are defined and not dynamic.

                2. We have drill down configured from there to fetch the row level data into a different workbook

                3. We realized there is no magic. We just have to structure the data correctly into polygot data management system composed of high speed in memory database and slower large database. The distribution of data was created by following the business needs. Like the daily transactions log was always available for drill down in the extract. Historic queries have knowledge matrix for reporting and row level data is fetched from hadoop system on need basis


                Hope this helps.


                - Sawan