1 of 1 people found this helpful
You're hardware is fine, it's the server tuning you need to perform now
Here's part of my batch file I run after installing Tableau Server:
REM Set PROD Tableau Server v.8 settings for high extract environment.
ECHO %Stopping server...
REM Turn on echoing to show what's being changed.
REM 28800 minutes = 8 hours
tabadmin set backgrounder.querylimit 28800
REM 900 seconds = 15 minutes
tabadmin set vizqlserver.querylimit 900
REM 20 minutes
tabadmin set vizqlserver.session.expiry.timeout 20
REM 120 minutes = 2 hours
tabadmin set wgserver.session.idle_limit 120
REM Set VizQL Server processes
tabadmin set workerX.vizqlserver.procs 3
REM Set Background Server processes
REM Currently must be manually set to two.
ECHO Configuring the server, please wait...
ECHO Starting the server...
Now go to the TabWiki to see what these commands mean and adjust for your environment. Mine were just guesses with a little geek's intuition based on how the server is being used so I can't really give any solid numbers for you. Changes are easy so if something doesn't work then just change the value and re-run the batch file.
It goes w/o saying (but we do it anyway) that well-formed extracts are key to quicker data retrieval so take a critical eye to building and filtering them.
Thank you, Toby.
I am not familiar with this code, but forwarded this to our DBA in charge of the configuration.
Is this just changing the time out limit on the server or does this actually help improve the performance of the extracts?
Thanks again for your help!
Thanks a million for your input. Love having great comrades within the community to help us newbies out.
To your comment about well-formed extracts. What are the 'definitions' of this? I would hope, given all the advances in technology that we should be able to leverage our well formed star-schemas 'as-is' so that we can use a single extract to support many custom analytics -- this reduces maintenance and enhancement efforts -- the largest cost of an organization as you build out new capabilities.
We have a simple fact table with 8 dimensions. If we limit the time parameters to rolling three years (2011 - 2013, for now), then we have 60 - 100m rows of data, and we have attempted to cut out unique text fields and redundant ids, but we want to keep as many attributes as possible to enable the exploratory analysis. Rich Lucas mentioned 60GB+, some of these may be closer to 200GB, but I got to imagine that still isn't 'large'.
Am I expecting too much from Tableau?
Rich, the code is for a MS-DOS batch file (.bat file extension). This is something your Tableau Administrator would execute, not a DBA (DataBase Administrator).
Yes, the values will affect how long TS (Tableau Server) waits for certain things, again, reference the TabWiki link.
Check with your database admin(s) to see if they have implemented time limits; they may need to adjust them for your business needs if they've been set up.
Ron, I'll admit I'm not a good person to fully explain "good" extracts. There are others here who are far better at it than I and there are multiple posts in the forum about these. Sorry I can't give you a better answer
Ron, a star schema is dandy in terms of a way to source an extract. Even if you were using a clunky OLTP design, the only "cost" that you'd pay is that it takes longer to source the data from your database to drop into the extract.
Normally, people like to keep their extracts fairly simple - in other words, a single fact table and however many related dimenions associated with said fact table. It'll be more difficult to create "one extract to rule them all" which contains many (unrelated or semi-related) fact tables and their dimensions.
We can do a couple hundred million rows without even breaking a sweat.
Hope this helps!
Thanks for the response Russell. I think, based on your answer, that I wasn't clear in our design.
We are indeed only looking to include a single star schema in a single extract. The one we will use in this example can be defined as:
* Fact Table with 55M rows with data from 01-01-2011 through present
* 6 Dimension tables with say 75% of the fields from each dimension table being used (we left out redundant keys, large text fields, and fields we felt were not going to be widely used)
The resulting extract, before compression, was about 80GB. It will enable us to answer tons of questions around this single meter of performance that we are trying to tackle; versus, our prior approach, we would have had 7-10 separate extracts from OLTP like tables tackling 'sections' of the newly formed dimensions. While those extracts ran great, it created a lot of confusion to the end users as to where to go to get the 'right' answer, and multiple places to go if you have multiple questions.
So, I feel like the design is right, based on your feedback above, but I have to say it isn't, because there is no way we could have a couple hundred million rows in our fact table and have the extract ever complete.
For example, let's go with 300M rows with a handful of dims and say that each row is 1500B wide in total (guessing here). That is 450GB in total size (right?). If the transfer rate is 100MB/s over the network to Tableau (that's all it can support?), then it will take 450,000 seconds or 125hrs. This is what our DBA/Admins are telling us.
How do you get it to transfer into Tableau Server?
Much appreciated !!
Hey Ron -
Your milegage will always vary based on the number of dimensions you have in play, the cardinality of their values, etc...
That being said, your numbers look pretty high to me. I have a 200M row extract on my Server with > 200M rows and comes in right at 5GB. I created a one billion rows "for kicks" last quarter, and it's only 22GB. Are you bringing in hundreds of dimension fields or something? Normally, once we have the dimension values in (which we compress), there is only a nominal increase in extract size as you add more fact rows - you're only adding a measure value and some dictionary keys we use to lookup dimenion values internally, after all.
We have tons of customers using extracts with < 500M rows in them, and a respectable number with larger ones, (1-2 B+)
Keep in mind that whatever number you come up with in terms of "publish time", you'll only pay it once when you initially publish it - after that you'll be doing incremental extract refreshes which complete much, much faster.
Your 5GB number has to be a compressed number. Which yes, once we got the extracts up originally (when we were on Server v7) they were compressed to about this size.
Initially though, they start off quite large. We aren't bringing in 100s of dimensional fields, but close to 50-60?! A lot of them are Small INT fields (flags) but there are the occassional text fields like customer name or worse, SFDC IDs.
You mention compression your dimension values, is there a step here that we are missing to overtly do this? Doesn't the extract build a denormalized file to bring in, and as such, you pay the price for each additional fact row for all of the dimensional values that are brought in? How are you able to only bring in the fact measures and keys?
As far as publishing only once, we haven't found a way yet using the extract dialog modal to leverage the incremental refreshes. Our fact tables have "hot month" updates, and so while it is a monthly snapshot fact, we are constantly updating the current month. If I use a variety of options to do the incremental, I have found that either I get duplicate rows or missing rows.
Hey Ron –
Nope, I’m talking “before and after” on-disk size – not the file which is temporarily zipped so it is smaller for transfer between client and Server. The 5 & 22 GB files are the TDEs as they exist on my desktop machine and on the disk at Server after they have been delivered. Here are some examples:
You can see two one-billion row extracts – 5.2GB and 22 GB.
…here’s the 22GB file stored in Server (FYI, 1 B rows, ~25 fields):
And the report which tells us about this sucker:
We compress all the dimension values with techniques like run-length encoding and dictionary lookups so that we’re often only storing the actual dimension value once – in an “internal dictionary”…Then, we’re saving a small, lightweight placeholder value in the original “big/long” dimension value’s place: (1 = cat, 2 = dog, 3 = congressman)
Those (very unique) SalesForce ids obviously won’t compress very well, the the text fields may not either.
You may find it makes more sense to actually split some of that stuff (especially the text, which I bet won’t be used in analysis all the time (more like a “lets go look at the raw rows now and then” scenario) into another data source. Maybe even keep it IN the database vs. put it in a second extract. You could then do the lion’s share of your analysis with the numeric data and allow people to view an already filtered report to see the text, etc.
How do I learn more about these run-length encoding and dictionary lookup capabilities? We are definitely bringing over some text fields as the DBAs denormalized a lot of the lookups coming out of the OLTP environment, so instead of a lot of 1,2,3 values we have the actual text values. I imagine this is adding up over the depth of the fact tables.
Hi Ron -
These are internal details of how the data engine compresses data - nothing that you have control over, really.
The most simple way to figure out which fields are your "problem" children is to create a small-ish extract of < 10M rows. Hide the text fields and/or dimension which will contain very unique information. Extract, check extract size, "Unhide" or or two fields, refresh the extract, repeat. You'll probably find several fields that are difficult to compress.
Another thing you can do is to "watch" your file system while extracts are being built. You'll see temporary folders and files being created on a per-column basis on your machine. Monitoring the file size for each field may give you an idea of the fields causing the problem. This technique is not supported / documented -- it's just a "trick" you can use if you how to...
I start building an extract. A temp folder (tmp718...) and a temp-file-per-field (tmp????.tmp)are created. Data being pulled from the database is dropped into these files...one file per field in the data source. You can't read / open these files:
If you drill down into the temp718 folder, you'll see an area where "human readable" (not the data, just the names of the fields) are present. In the screenshot below, they're all empty because we're still pulling data down from the server and putting it in the tmp??.tmp files above.
Once all of the data is pulled down, the file sizes will change to show how much stuff they have in them. This happens when you see Tableau has started to "Sort" and "Compress" your data. In the shot below, I'm sorting descending by file size:
You'll note that the files that are the largest are contain "names" - highly unique values like the name of an image (Create_File_Name) I might serve to someone in an online ad.
I can't tell you much more about how this works (I frankly don't know - I just watch what's going on and inferring size from this "trick" has worked pretty well for me in the past).
Hope this helps!