7 Replies Latest reply on Aug 8, 2017 1:20 AM by Kelly Stirman

    Tableau-Elasticsearch WDC Performance Issues

    k v

      Hey guys,

       

      I'm a fairly new Tableau user (most of our current analysts use it, but as a developer it's fairly new territory for me), but I've run into some issues that I was hoping someone had the answers to.  Currently we have massive datasets sitting in file format on various layers of our application and I'm trying to get that connected to Tableau for Visualizations.  I've also created a custom Elasticsearch-Tableau WDC which seems to work pretty well for datasets less than 10K records, but anything over that takes an abhorrently long time to process.  Is there a size limitation for data extracts using the WDC?

       

      1.) What is the maximum amount of records that a WDC supports?  Say for example I had a datastore (ES) with an Index/Type with about 10 million records of 20 fields per record.  Is this a number that's way out of the league of the Tableau WDC's scope for creating a TDE?

       

      2.) Using Elasticsearch and the Scroll API's, I'm able to get fairly responsive rest calls for batched requests of around 5000 - 10000 records per call, but when I debug through my custom Tableau WDC using the Simulator (which is very helpful, thanks btw), it seems like the rest calls return fairly quckly, but after I do the tableau.datacallback it takes about 20-30seconds before another rest call is made (and this gets progressively longer as more queries are sent out).  Is this the expected behavior?

       

      Does anyone have tips as to what the best approach would be to visualize such a large dataset (preferably from Elasticsearch); I'm willing to make concessions and use other data stores, but would like to keep that as for a last resort type situation.

       

      ---Additionally, any chance we see a native Tableau Connector for Elasticsearch anytime soon?  Would be nice to be able to do a live connection to Elasticsearch and have Tableau execute ES Queries to return data back.  (couldn't find any sources stating if this was going to be done anytime soon, or if it's in the works?)

       

      Thanks!

       

      -Using Tableau 9.3 (wdc 1.1.1)

      -Using Elasticsearch 2.2.x+

      -Dataset consists of millions of records (< 10 million), 20 columns across.

        • 1. Re: Tableau-Elasticsearch WDC Performance Issues
          Jeff D

          The web connector isn't really meant for large datasets.  Web connectors provide great flexibility, not necessarily great scalability.

           

          The data retrieved from a web connector is used to build an extract.  Extracts can be quite large, so that won't be a limitation for you.

           

          However, 20-30 seconds between callbacks doesn't seem right.

           

          Here's an experiment to try: write a javascript function that invokes your callback in a loop until it completes.  How long would it take to retrieve the 10 million rows?  Taking Tableau out of the picture, this will give you a sense of the lower bound on performance.

           

          The Ideas forum is a good place to make requests.  Here's a page for an Elasticsearch connector: https://community.tableau.com/ideas/3776

          • 2. Re: Tableau-Elasticsearch WDC Performance Issues
            k v

            Thanks for the prompt Reply Jeff.  Your suggestion of doing a rest call to ES to find out the base time it takes to get all records back was the first thing I did.  Here is an output of the Start and End times (with some other metrics I used for debugging).

             

            Starting Call to ES At: 1:57:46AM

            Total Records to be queried: 1314606

            Start Index: 0

            End Index: 1314606

            Batch Size Per Request: 5000

             

            Finished Processing All Records: 1:58:13AM

            Total Records Processed: 1314606

            Total Requests Made: 263

             

            I just did this quickly with a 1.3 million row dataset and I aggregate the "data.hits.hits" field of the ES return object into an array and then print out the Array length when all the data has been retrieved.

             

            As you can see it comes back with this request in a fairly decent amount of time (albeit its not doing any processing of the data, but rather just getting the results and then continuing on to the next call.  (I'm using the scroll api from ES to get these results and it's been fairly consistent).  I know this isn't a 10million row dataset (as I don't have one locally in my dev environment), but this shows that the calls to ES are not the bottleneck and something is happening between getting the data back from ES and making the Tableau.dataCallBack() method.

             

            I will post my connector code here in a second (along with the simulator).  I must be doing something incorrectly if you're telling me that the Tableau.datacallback method shouldn't be a bottleneck.

             

            Oh and I should mention, I am testing this using the Simulator (with the SDK), so it's trying to print out all the results in HTML to the screen, so that might be the other issue, but I don't have a Tableau instance on my local machine to be able to test the WDC locally.  (my trial license expired a couple days ago).

            • 3. Re: Tableau-Elasticsearch WDC Performance Issues
              k v

              Just an update.  I'm still running that same dataset above through the my Elasticsearch WDC in Tableau Simulator (it's been running for about 20 minutes). 

               

              But here's a much smaller dataset with some debug information to show you how long it's been taking with my Elasticsearch WDC in the Tableau Simulator.

               

              Below is a 30k dataset:

               

              Total Records to be queried: 30293

              Start Index: 0

              0

              Batch Size Per Request: 5000

              Starting Call to ES At: 4:13:48 am

               

              elasticsearchRestUtils.js:15 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/test6/typed/_search?scroll=1m to get the ElasticSearch Data Scrolling: true

              Counter: 5000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

              Counter: 10000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

              Counter: 15000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

              Counter: 20000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

              Counter: 25000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

              Counter: 30000 StartIndex: 0 EndIndex: 30293 HasMoreData?: true

              elasticsearchRestUtils.js:21 Making Rest Call to: http://elastic-01.dev.sandbox.com:9200/_search/scroll to get the Scrolled ElasticSearch Data

               

              Finished Processing All Records: 4:15:09 am

              Total Records Processed: 30293

              Counter: 30293 StartIndex: 0 EndIndex: 30293 HasMoreData?: false

               

              wdc-simulator.js:299 No More Data

              Calling Tableau to Shutdown......

               

               

              Now that above is reasonable, but once you get past the 30k threshold, it takes literally forever.  It gets to around 45k records and crashes my browser (after about 20 mins).

               

               

              Here is my code.  Just start up the Simulator and everything else should be filled out.

               

              Dropbox - webdataconnector-master.zip

              • 4. Re: Tableau-Elasticsearch WDC Performance Issues
                Jeff D

                If you're talking about performance, the simulator doesn't count.  You're not using Tableau, you're using your browser.  (Sorry I won't have time to read through your message until next week; in the meantime, perhaps other folks will jump in.  Good luck!)

                • 5. Re: Tableau-Elasticsearch WDC Performance Issues
                  Brendan Lee

                  I wouldn't recommend doing a performance evaluation using the simulator.  It's just a simple web app that is designed to help you debug your web data connectors, but if you are trying to bring back a ton of rows, you'll just hit normal performance limitations of a browser (the simulator is programmatically generating a giant table HTML and that would be super slow).

                   

                  I would recommend using the simulator to bring back a subset of data and make sure you are getting the right data, but then use the WDC in Tableau desktop to measure performance.

                   

                  The WDC can support a large volume of data (I've worked with someone who had a web data connector that pulled hundreds of millions of rows).  But that can take a really long time! There are some other options for getting data into Tableau (like the Tableau SDK) if you have too much data and need better performance.


                  The WDC will have two performance bounds that you can monitor:

                  1. The time it takes for the web data connector to pull data from elastic search (as Jeff mentioned, you can measure this to get a bound on how fast it could possibly be before Tableau is even involved).
                  2. The time it takes for Tableau to create an extract from the data it receives from the web data connector.  Once all of the data has been fetch from the web service, Tableau should be able to create an extract in relatively the same amount of time it would for another data source (for example, creating an extract from an excel file with the same number of rows and columns).

                   

                   

                  A miscellaneous performance tip:  Tableau is better at handling a large volume of rows than it is at handling a large volume of columns.  If your connector brings back 60+ columns, you might see some performance degradation there.

                  • 6. Re: Tableau-Elasticsearch WDC Performance Issues
                    k v

                    Brendan:

                     

                    I understand the limitations of the Web Simulator (See Post #2 of mine above). I did just straight rest calls to ES to retrieve ~1.3 million records and it comes back in a fairly short amount of time (comparatively anyways).  I'm now having issues where there are discrepancies between the Simulator/WDC and Tableau.

                     

                    If I run my code in the Simulator, it works fine (still super slow), but in Tableau it does not.

                     

                    I modified my Web Data Connector code to use the Elasticsearch Scroll API: Scroll | Elasticsearch Reference [2.3] | Elastic


                    And I was able to achieve a fairly quick response from ElasticSearch using approximately 1.3 mllion records (again see Post #2 above).

                     

                    The problem arises in a discrepancy between the Simulator and the Tableau Desktop (WDC) API's.  Where the code works fine in the simulator and is fully functional with the exception of it being very slow, whereas in Tableau Desktop it seems to not function correctly.

                     

                    The basic logic behind my rest calls is as follows:

                     

                    1.) Make rest call to Elasticsearch Scroll API

                    2.) Get Initial Result and then increment the counter and add results to the result Data Extract

                    3.) Also get the Scroll ID that was returned and use it for all subsequent calls to get the next batch of records.

                     

                    So basically After every rest call, I look for a scroll_id and a counter which increments after every successful call and I place it into a map called "retmap" and return that to the tableau.datacallback method like so:

                     

                    tableau.dataCallback(dataToReturn, JSON.stringify(retMap), hasMoreData);

                     

                    This logic works fine in the Simulator using Firefox/Chrome, but for some reason when I use the Tableau Desktop instance to run it, it seems like the retMap is not being parsed back to the method on the next run (looking at the logs, it's undefined).

                     

                    Please see attached for my Code:

                     

                    Dropbox - webdataconnector-master.zip

                     

                    Unpack the Zip. Put it in your C drive (or somewhere on a mac) and run either the

                    "start_server_windows.bat" or

                    "start_server_mac.sh" from command line.

                    Then open your browser and goto localhost:8888/Simulator/
                    Then click the "Run Interactive Phase" Button
                    In the new window, put in a connection name e.g. "testConnection"
                    fill out the ElasticSearch endpoint details e.g. http://my-elastic-search-url:9200
                    Hit the refresh button.
                    pick an index from the "index" drop down
                    pick a type from the "types" dropdown
                    Make sure the "fields" dropdown populates.
                    Then fill out the start index/end index/batchsize appropriately
                    And hit extract.

                    And if you want to get all records, leave the start index/end index blank.

                     

                    If you don't have an ES endpoint, I can create a public one for you (for testing).

                     

                    Additionally my JS files are in the ElasticSearch folder.

                    • 7. Re: Tableau-Elasticsearch WDC Performance Issues
                      Kelly Stirman

                      In case it is of interest, Dremio is a new open source project that works with very large datasets. Dremio compiles SQL queries into the underlying DSL, including Painless scripts where appropriate. JSON files are read into Apache Arrow in-memory buffers and processed using Dremio's distributed SQL execution engine.

                       

                      While it's designed to run in clusters up to 1000+ nodes, Dremio will run fine on your laptop to give it a try:

                       

                      https://www.dremio.com/download

                       

                      Here are two tutorials for using Dremio with Elasticsearch:

                       

                      Unlocking SQL on Elasticsearch - Dremio

                      Unlocking Tableau on Elasticsearch - Dremio

                       

                      Dremio works with other data sources too, like MongoDB, Hadoop, Amazon S3, relational databases, and more.

                       

                      Give it a shot, and good luck!