6 Replies Latest reply on Feb 18, 2017 3:04 PM by Christian Schwehm

    How to organize 15bn data points?

    Christian Schwehm

      Hi all,


      I am about to structure the data for a huge project. We have about 100 properties for some 30.000 items on a daily level for the last 20 years => roughly 100*30.000*5.000=15 bn Datapoints.

      In most cases we will be interested in a subset of the properties of a subset of the items at a specific date. but sometimes as well in the weighted value of a subset of the items for a subset of the properties calculated over time.


      My question now is whether you would use the 100 properties as columns with the property names or whether you think it is better to use a normalized datasource within Tableau:


      15 bn rows with date, item and property as key or 150 mn rows with date and item as key and all the properties in columns?


      Curious for your thoughts and experience, as the 15 bn rows would make it easier to filter and reference but i guess the 150 mn version is faster...





        • 1. Re: How to organize 15bn data points?

          Hi Christian Schwehm,


          First of all



          Second: A normalised Datasource would work well, however it could take some times to load. I would prefer your first option here.

          Would CrossTab data work

          1 of 1 people found this helpful
          • 2. Re: How to organize 15bn data points?
            Jonathan Drummey

            I'm thinking "tall" w/15B records, here's why:


            If the properties are created as columns then calculations to make subsets of properties are going to be a pain to create e.g. [Property1] + [Property2] + [Property3]... and if there can be Nulls that means you'll have to wrap every single column in IFNULL([Property1],0) and so on. This will be slow to build, potentially impact performance more than having a pivoted "tall" source, and certainly be painful to maintain.


            Another thing to keep in mind if you're building dashboards is that the "one data source to rule them all" approach often doesn't work for Tableau, for performance and speed of interactivity we'll often need to create aggregated views and/or extracts of the data. For example if there's an overview view that is the last 5 years at a monthly level per property then a data source with 5 years * 12 months * 100 properties = 6,000 rows that will be fast to load and work with, and then from there the user could use Filter Actions to apply to other detail views/dashboards that are accessing the larger data set but in a filtered way that keeps them fast.


            FYI the best reference on improving Tableau performance is the Designing Efficient Workbooks white paper by Alan Eldridge: http://www.tableau.com/learn/whitepapers/designing-efficient-workbooks, be sure to check that out as well.



            2 of 2 people found this helpful
            • 3. Re: How to organize 15bn data points?
              Yuriy Fal

              May I add my 2c please?


              "Organizing" would mean different things (all of them matters):


              -- loading data from original sources in raw format (into a "staging area")

              -- storing data effectively (typically it means sorting / compressing / cleaning)

              -- making (derived) data structures useful for different kind of analyses.


              The latter point is typically underestimated (even when doing the former two "right").

              What kind of analyses are you expecting? Two main groups would be ML and OLAP.


              The former -- applying Machine Learning algorithms --

              is better served from a flat table with Items-Dates as Rows ("instances")

              and all Attributes (and their derivates) as Columns ("features").


              The latter (OLAP) is where Tableau is best for.

              As Jonathan mentioned above (and rightfully so)

              for the most OLAP (aggregate & filter) scenarios

              the better shape looks like a so called "tall" dataset --

              comprising of facts & dimensions, which in your case

              would be a "periodic snapshot" (a daily one)

              of Items' Attribute Names & Values (current as of date).


              How wide / tall this datasource would depend on many things:


              -- the data types of Attributes Values: are they just one data type

              (INT, FLOAT, BOOLEAN, DATE/TIME, STRING) or more? If the latter,

              then at least a column per data type is recommended for a "tall" DS


              -- are filters on different Attributes goes mainly with AND clause? For example

              ( [Attribute1] = 'Value1' OR [Attribute1] = 'Value2')  AND ( [Atribute2] < 0.5 )

              If mostly yes, then a somewhat "wider" table --

              with [Attribute1] and [Atribute2] as distinct columns --

              would be more performant then her "taller" cousin.


              ... and it would go further to a point of reconsidering a "wide-flat-table" approach :-)


              Besides, the most restricting factors typically would be these ones:

              -- How easy is to transform the original raw data into the "right" shape?

              -- Could these transformations be automated / scheduled / set & forgotten?


              'Feeling that's enough for a 2c :-)




              1 of 1 people found this helpful
              • 4. Re: How to organize 15bn data points?
                Christian Schwehm

                hello lenaic, that's also my current intention - although we are still deciding...

                • 5. Re: How to organize 15bn data points?
                  Christian Schwehm

                  hi jonathan, your point is very vallid and the only reason why i am considering the tall version at all. Luckily the data is rather clean and the columns are properties (columns) are rather stable so i think we will risk the small version :-)

                  • 6. Re: How to organize 15bn data points?
                    Christian Schwehm

                    hi yuri, thanks for the great comments. The type is an olap version and the attributes are floats. Filters would usually go as and and the transformation of the data is easy as most is feeded via an olap layer that has been built especially for this project.


                    CHeers christian