5 Replies Latest reply on Sep 21, 2012 9:30 AM by Robert Morton

    Speed of "Importing data" (extract)

      Hi,

      I was wondering what determines speed of how rows are imported to extract?

       

      I have hadoop cluster which prepares 9Mil or rows in 1 minute to be extracted. But then tableau starts to import data its speed only about 100k/2min. What effects this speed? Network could be a bottle neck, but we have quite decent connection with Amazon cloud. What could be a bottle neck? Engine of Tableau?

       

      I have tried make extract from CSV file the speed is about 50k/s. So tableau engine should not be a problem...

       

      What maximum speed anybody have reached extracting data from RDBMS DB?

       

      V.

        • 1. Re: Speed of "Importing data" (extract)
          Robin Kennedy

          My 1.2m row, 20 col SQL server data source extracted to tableau in about 20 seconds - so that's about 60k/s. Both Tableau and SQL server are running on the same machine

          1 of 1 people found this helpful
          • 2. Re: Speed of "Importing data" (extract)

            Thanks Robin for comment. My machines are seperate, but network bandwidth is at least 1Mb/s between machines. Still can't think what could be a bottleneck.

            • 3. Re: Speed of "Importing data" (extract)
              Robert Morton

              Hi Vaidas,

               

              A common cause of poor performance with Hive is when a user attempts to extract a data source with a large number of String fields, or when the String fields contain a very large amount of data.

               

              If you have any fields in your data source which you find are not analytically useful, you can Hide them and they will no longer be pulled as part of the extract you create. This can substantially improve the performance for the import speed as well as the speed of later extract operations like sorting.

               

              I hope this helps,

              Robert

              1 of 1 people found this helpful
              • 4. Re: Speed of "Importing data" (extract)

                Thanks Robert for suggestion. Its helpful. Do you know any technical background behind this? Of course I will test performance on your idea, Still how do you think strings can effect extract speed? 20%? 100% 5x?

                Right now now table contains ~30 columns and most of them are strings. Speed is less then 1000 rows per second.

                I don't think if I change all strings to integers I will win 10 times increase.

                 

                Also I have noticed that i have quite high network latency with server - around 100ms. Do you think this may be a reason?

                 

                If someone have experience maybe can share what extract speeds they were able to get from cloud?

                 

                V.

                • 5. Re: Speed of "Importing data" (extract)
                  Robert Morton

                  Hi Vaidas,

                   

                  As the size of each row of data increases, Tableau adjusts its buffers so that the total memory used for result set transfer is constrained. Strings can easily dominate the buffers, causing Tableau to have to fetch fewer rows in each batch. With a high-latency connection this can add up quickly, and in the worst case Tableau will only fetch a single row at a time.

                   

                  You can try using an unsupported but useful tool we have for diagnosing and tuning such problems, and possibly working around them. The explanation is below. This technique is more broadly used for general-purpose ODBC connections in Tableau, covered in the KB article Customizing and Tuning ODBC Connections, but only a subset of that article is applicable to data sources like our Hive connector. In particular, read the section on "Making Customizations Global" which will explain the purpose of the file I have attached.

                   

                  Download the attached file, unzip it and copy the '.tdc' file to your documents folder called "My Tableau Repository\Datasources". This file provides an override for Tableau's behavior with query result set buffers. Note that the file is intended for use with the Cloudera Hadoop Hive connector, and you'll need a different file if you intend to use the MapR Hadoop Hive connector. Once you've copied the file, quit and relaunch Tableau so it can recognize them. Then try creating an extract and see if performance has improved. You will likely also notice that Tableau will use more memory during this process.

                   

                  I hope this helps!

                  Robert

                  1 of 1 people found this helpful