5 of 5 people found this helpful
Hey Jonathan -
When thinking about extracts, the "width" of the final extract (all columns in the extract) is often as, and sometimes more important than the height/number of rows. So keep that in mind for the rest of the "commentary".
I just pulled ~5M rows from SQL (via a 10-table join) and it took 5 min, 10 sec - If the output isn't too wide and your database is responsive, ~1M rows / minute is fairly standard.
If you have a particularly "fast" data source, you can do better. For example, when extracting rows from a huge local extract with not too many columns, I was doing about 80M an hour, or ~1.33M / minute. Yay!
Flip this equation - I had a client pulling ~190M rows from a SQL Server via a fairly ugly query - it also returned 100+ columns, some of which contained extremely unique values (which makes for slower sorting). It took these folks about 8 hours to return results from SQL, and then an additional 10+ hours to process/sort because of the super-high cardinality in some of their fields - about 174K rows / min. Boo!
So...I'd say that 11M rows in 2-3 hours is too slow. The things you should look at:
- How long does it take after you start the extract process for SQL Server to start returning records (based on the Tableau dialog that says "X rows retrieved so far")? This is pure "SQL think time"
- How long does it take to return those 11M rows? This is a combination of SQL Processing time, latency in terms of returning 11M rows across your network, and our engine dropping temporary "batches" of rows to disk locally.
- How long does Tableau take in the "Optimizing" and "Sorting" phases - this is pure Tableau time
If you see the lion's share of your time being sucked up in bullets 1 & 2, you should see if you can optimize SQL - look at the tables/query you're using and add appropriate indexes, etc. etc.
If Tableau is taking a long time to optimize/sort, you may have lots of columns and/or columns that contain very unique values which take longer to sort. Consider removing those fields from your data source before you create the extract.
Hope this helps!
Thanks for the reply. Definately the bulk of the time is spent in the database from your description. So, I will look into optimizing the database.
I modified the data sources to remove unused columns as you suggested and it helped a bit (it definately made the file size of the extracts smaller for some of the extracts). I'm still working on optimizing the SQL side. Thanks again.
Since you were able to explain the extract process so well, I thought I'd ask a related question. When publishing this workbook the "saving packaged workbook" step takes almost an hour. Is there something I could do to optimize this step, or is this what I should expect?
Interesting! I'm frankly not sure.
Let's step back from Server for a moment - If you save your workbook as a packaged workbook to the local machine, how long does it take? When we create the packaged workbook, we actually "zip" the extract, which can take a little while depending on the size..but I've never seen it take an hour.
I just saved the workbook as a packaged workbook (without publishing it) and it took 61 minutes.
I have 9 extracted data sources of varying sizes, now that I have removed uneeded columns.
Sizes of the extracts
So I'm assuming it has to package these all up. The resulting twbx is ~ 354MB
I'm hoping to make the 520 MB extract smaller, but the 420 and 430 mb extracts are as small as they're going to be for now.
Hey Jonathan -
I've just called the Guinness Book of World Records - "Longest Packaged Workbook Creation Time, Ever" - Your plaque is on the way
~2,5 GB of "raw extract" isn't small, but's it's not super-duper large either. I just took a 9GB extract (57 columns, 135M rows) I had on my box and saved it as a packaged workbook - it took just over 4 minutes and left me with a ~4 GB workbook.
So, I've got to ask - what sort of machine are you doing this work on? Is it relatively up--to-date in terms of hardware? I'm beginning to wonder if part of the reason your extract process AND saving your packaged workbook are slow is that your CPU and/or disk and/or RAM is a bit "long in the tooth".
It's definately no powerhouse pc.
Intel i3-2120 @ 3.3 GHz
32-bit windows 7 professional
Unfortunately, its a work pc, and not likely to get upgraded (since I just got upgraded recently).
Do you think this is what's causing the problems?
Hello Russell Russell Christopher
I have been research and found your communication with Jonathan. This is really useful article and your reply make me so surprised on fasten you able to done on extracting.
May I ask you that what platform or PC specification you have use for develop tableau report.
I found same case as Jonathan faced to. and I found that my slowest point is on second step in bullet 2.
Looking forward for your response,
4-core i7 processor w 32 GB ram
What is your storage? Is it SSD? Or, have you always housekeep it, i.e. defragment?
I am facing some issues with extracting the 16M rows worth of data. Database is sitting on AWS, so connectivity is one thing however the complete process take more than 12 hrs to process 2M records and it throws some error or the other. I have not been able to get the complete extract till date (have been trying since last week).
I have 20 odd joined tables with over 150+ fields. However there some ID fields which I think may be slowing down the process since they are unique. Can I go ahead and hide those as well? Is hiding same as removing the fields? Since these IDs or index fields are the basis of the joins, I am little skeptical to hide them.
Any advice is welcome. Thank you.
1 of 1 people found this helpful
I wouldn't worry about the Tableau part of this equation yet.
Do you know:
- How long it takes to execute your query with 20 joins against the db on AWS?
- How long it takes to return ALL that data over the wire from AWS?
I'd setup some sort of query tool on your local machine and actually try it out. Write a query which returns only the fields you want from the joined 20 tables and see how long it takes to execute & return to your query tool. This represents time that Tableau "has nothing to do with".
After you understand this variable, add Tableau into the picture.
You were right. Its taking a lot of time to execute the query itself. This is the best data-set I have and cant reduce fields anymore. What options do I have in order to create a full extract?
How to measure tableau extract total time for each dashboard?