12 Replies Latest reply on Oct 15, 2013 9:06 AM by Russell Christopher

    What server hardware requirements will I need to run large extracts (60+ GB)?

    Rich Lucas

      I know Tableau's customers include some of the largest companies in the world and imagine they have VERY large extracts that they are running. 

       

      Currently we have hardware with:

       

      CPU - 8-Core

      RAM - 32 GB

      Storage - 250 GB

       

      We are having extracts time out that are anywhere between 20-80 GB in size.  Does anyone have any recommendation to help the performance. 

       

      Running the query in SQL for one of the data sources runs in about 23 minutes, but the extract in Tableau takes over 2 hours. 

       

      Any input would be EXTREMELY helpful!

       

      Thanks,

       

      Rich

        • 1. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
          Toby Erkson

          You're hardware is fine, it's the server tuning you need to perform now


          Here's part of my batch file I run after installing Tableau Server:

          @ECHO OFF

          REM  Set PROD Tableau Server v.8 settings for high extract environment.

          REM  20May2013

           

          ECHO %Stopping server...

          tabadmin stop

           

          REM  Turn on echoing to show what's being changed.

          @ECHO ON

           

          REM 28800 minutes = 8 hours

          tabadmin set backgrounder.querylimit 28800

           

          REM 900 seconds = 15 minutes

          tabadmin set vizqlserver.querylimit 900

           

          REM 20 minutes

          tabadmin set vizqlserver.session.expiry.timeout 20

           

          REM 120 minutes = 2 hours

          tabadmin set wgserver.session.idle_limit 120


          REM Set VizQL Server processes
          tabadmin set workerX.vizqlserver.procs 3

           

           

          REM Set Background Server processes
          REM Currently must be manually set to two.

          ...

          @ECHO OFF
          ECHO Configuring the server, please wait...
          tabadmin configure
          ECHO Starting the server...
          tabadmin start
          ECHO Done!

           

          Now go to the TabWiki to see what these commands mean and adjust for your environment.  Mine were just guesses with a little geek's intuition based on how the server is being used so I can't really give any solid numbers for you.  Changes are easy so if something doesn't work then just change the value and re-run the batch file.

           

          It goes w/o saying (but we do it anyway) that well-formed extracts are key to quicker data retrieval so take a critical eye to building and filtering them.

          1 of 1 people found this helpful
          • 2. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
            Rich Lucas

            Thank you, Toby.

             

            I am not familiar with this code, but forwarded this to our DBA in charge of the configuration. 

             

            Is this just changing the time out limit on the server or does this actually help improve the performance of the extracts?

             

            Thanks again for your help!

             

            Rich

            • 3. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
              Ron Chipman

              Toby,

               

              Thanks a million for your input.  Love having great comrades within the community to help us newbies out.

               

              To your comment about well-formed extracts.  What are the 'definitions' of this?  I would hope, given all the advances in technology that we should be able to leverage our well formed star-schemas 'as-is' so that we can use a single extract to support many custom analytics -- this reduces maintenance and enhancement efforts -- the largest cost of an organization as you build out new capabilities.

               

              We have a simple fact table with 8 dimensions.  If we limit the time parameters to rolling three years (2011 - 2013, for now), then we have 60 - 100m rows of data, and we have attempted to cut out unique text fields and redundant ids, but we want to keep as many attributes as possible to enable the exploratory analysis.  Rich Lucas mentioned 60GB+, some of these may be closer to 200GB, but I got to imagine that still isn't 'large'.

               

              Am I expecting too much from Tableau?

              • 4. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                Toby Erkson

                Rich, the code is for a MS-DOS batch file (.bat file extension).  This is something your Tableau Administrator would execute, not a DBA (DataBase Administrator).

                 

                Yes, the values will affect how long TS (Tableau Server) waits for certain things, again, reference the TabWiki link.

                Check with your database admin(s) to see if they have implemented time limits; they may need to adjust them for your business needs if they've been set up.

                • 5. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                  Toby Erkson

                  Ron, I'll admit I'm not a good person to fully explain "good" extracts.  There are others here who are far better at it than I and there are multiple posts in the forum about these.  Sorry I can't give you a better answer

                  • 6. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                    Russell Christopher

                    Ron, a star schema is dandy in terms of a way to source an extract. Even if you were using a clunky OLTP design, the only "cost" that you'd pay is that it takes longer to source the data from your database to drop into the extract. 

                     

                    Normally, people like to keep their extracts fairly simple - in other words, a single fact table and however many related dimenions associated with said fact table. It'll be more difficult to create "one extract to rule them all" which contains many (unrelated or semi-related) fact tables and their dimensions.

                     

                    We can do a couple hundred million rows without even breaking a sweat.

                     

                    Hope this helps!

                    • 7. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                      Ron Chipman

                      Thanks for the response Russell.  I think, based on your answer, that I wasn't clear in our design.

                       

                      We are indeed only looking to include a single star schema in a single extract.  The one we will use in this example can be defined as:

                      * Fact Table with 55M rows with data from 01-01-2011 through present

                      * 6 Dimension tables with say 75% of the fields from each dimension table being used (we left out redundant keys, large text fields, and fields we felt were not going to be widely used)

                       

                      The resulting extract, before compression, was about 80GB.  It will enable us to answer tons of questions around this single meter of performance that we are trying to tackle; versus, our prior approach, we would have had 7-10 separate extracts from OLTP like tables tackling 'sections' of the newly formed dimensions.  While those extracts ran great, it created a lot of confusion to the end users as to where to go to get the 'right' answer, and multiple places to go if you have multiple questions.

                       

                      So, I feel like the design is right, based on your feedback above, but I have to say it isn't, because there is no way we could have a couple hundred million rows in our fact table and have the extract ever complete.

                       

                      For example, let's go with 300M rows with a handful of dims and say that each row is 1500B wide in total (guessing here).  That is 450GB in total size (right?).  If the transfer rate is 100MB/s over the network to Tableau (that's all it can support?), then it will take 450,000 seconds or 125hrs.  This is what our DBA/Admins are telling us.

                       

                      How do you get it to transfer into Tableau Server?

                       

                      Much appreciated !!

                      • 8. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                        Russell Christopher

                        Hey Ron -

                         

                        Your milegage will always vary based on the number of dimensions you have in play, the cardinality of their values, etc...

                         

                        That being said, your numbers look pretty high to me. I have a 200M row extract on my Server with > 200M rows and comes in right at 5GB. I created a one billion rows "for kicks" last quarter, and it's only 22GB. Are you bringing in hundreds of dimension fields or something? Normally, once we have the dimension values in (which we compress), there is only a nominal increase in extract size as you add more fact rows - you're only adding a measure value and some dictionary keys we use to lookup dimenion values internally, after all.

                         

                        We have tons of customers using extracts with < 500M rows in them, and a respectable number with larger ones, (1-2 B+)

                         

                        Keep in mind that whatever number you come up with in terms of "publish time", you'll only pay it once when you initially publish it - after that you'll be doing incremental extract refreshes which complete much, much faster.

                        • 9. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                          Ron Chipman

                          Thanks Russell.

                           

                          Your 5GB number has to be a compressed number.  Which yes, once we got the extracts up originally (when we were on Server v7) they were compressed to about this size.

                           

                          Initially though, they start off quite large.  We aren't bringing in 100s of dimensional fields, but close to 50-60?!  A lot of them are Small INT fields (flags) but there are the occassional text fields like customer name or worse, SFDC IDs.

                           

                          You mention compression your dimension values, is there a step here that we are missing to overtly do this?  Doesn't the extract build a denormalized file to bring in, and as such, you pay the price for each additional fact row for all of the dimensional values that are brought in?  How are you able to only bring in the fact measures and keys?

                           

                          As far as publishing only once, we haven't found a way yet using the extract dialog modal to leverage the incremental refreshes.  Our fact tables have "hot month" updates, and so while it is a monthly snapshot fact, we are constantly updating the current month.  If I use a variety of options to do the incremental, I have found that either I get duplicate rows or missing rows.

                          • 10. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                            Russell Christopher

                            Hey Ron –

                             

                            Nope, I’m talking “before and after” on-disk size – not the file which is temporarily zipped so it is smaller for transfer between client and Server. The 5 & 22 GB files are the TDEs as they exist on my desktop machine and on the disk at Server after they have been delivered. Here are some examples:

                             

                             

                            You can see two one-billion row extracts – 5.2GB and 22 GB.

                             

                             

                             

                             

                            …here’s the 22GB file stored in Server (FYI, 1 B rows, ~25 fields):

                             

                             

                             

                            And the report which tells us about this sucker:

                             

                             

                             

                             

                            We compress all the dimension values with techniques like run-length encoding and dictionary lookups so that we’re often only storing the actual dimension value once – in an “internal dictionary”…Then, we’re saving a small, lightweight placeholder value in the original “big/long” dimension value’s place: (1 = cat, 2 = dog, 3 = congressman)

                             

                            Those (very unique) SalesForce ids obviously won’t compress very well, the the text fields may not either.

                             

                            You may find it makes more sense to actually split some of that stuff (especially the text, which I bet won’t be used in analysis all the time (more like a “lets go look at the raw rows now and then” scenario) into another data source. Maybe even keep it IN the database vs. put it in a second extract. You could then do the lion’s share of your analysis with the numeric data and allow people to view an already filtered report to see the text, etc.

                            • 11. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                              Ron Chipman

                              Thanks Russell,

                               

                              How do I learn more about these run-length encoding and dictionary lookup capabilities?  We are definitely bringing over some text fields as the DBAs denormalized a lot of the lookups coming out of the OLTP environment, so instead of a lot of 1,2,3 values we have the actual text values.  I imagine this is adding up over the depth of the fact tables.

                              • 12. Re: What server hardware requirements will I need to run large extracts (60+ GB)?
                                Russell Christopher

                                Hi Ron -

                                 

                                These are internal details of how the data engine compresses data - nothing that you have control over, really.

                                 

                                The most simple way to figure out which fields are your "problem" children is to create a small-ish extract of < 10M rows. Hide the text fields and/or dimension which will contain very unique information. Extract, check extract size, "Unhide" or or two fields, refresh the extract, repeat. You'll probably find several fields that are difficult to compress.

                                 

                                Another thing you can do is to "watch" your file system while extracts are being built. You'll see temporary folders and files being created on a per-column basis on your machine. Monitoring the file size for each field may give you an idea of the fields causing the problem. This technique is not supported / documented -- it's just a "trick" you can use if you how to...

                                 

                                I start building an extract. A temp folder (tmp718...) and a temp-file-per-field (tmp????.tmp)are created. Data being pulled from the database is dropped into these files...one file per field in the data source. You can't read / open these files:

                                 

                                ScreenHunter_05 Oct. 15 08.59.jpg

                                 

                                If you drill down into the temp718 folder, you'll see an area where "human readable" (not the data, just the names of the fields) are present. In the screenshot below, they're all empty because we're still pulling data down from the server and putting it in the tmp??.tmp files above.

                                 

                                ScreenHunter_06 Oct. 15 08.59.jpg

                                 

                                Once all of the data is pulled down, the file sizes will change to show how much stuff they have in them. This happens when you see Tableau has started to "Sort" and "Compress" your data. In the shot below, I'm sorting descending by file size:

                                 

                                ScreenHunter_07 Oct. 15 09.06.jpg

                                You'll note that the files that are the largest are contain "names" - highly unique values like the name of an image (Create_File_Name) I might serve to someone in an online ad.

                                 

                                I can't tell you much more about how this works (I frankly don't know - I just watch what's going on and inferring size from this "trick" has worked pretty well for me in the past).

                                 

                                Hope this helps!