5 Replies Latest reply on Sep 27, 2017 10:26 AM by Jaros Kachniarz

    Creating an extract efficiently with the Java SDK

    Rob R

      I’m working on an application that converts a stream of documents into a Tableau extract. Each document has approximately 100 fields and the document stream has about 40 million total documents. In production, the fields are a mixture of string and numeric types.

       

      In our scale testing, there have been concerns about length of time that the Tableau SDK is consuming a sizeable portion of the CPU while creating the extract. Creating the extract appears to be a computationally heavy operation and consumes a CPU core for the entire duration of the creation (one busy thread creating the extract). Processing 2 million documents takes about 10 minutes on my dev machine, which means 40 million documents will take about 3.5 hours (I’ve seen a linear relationship between document count and processing time). It uses 2GB of memory too. I am using the java macOS Tableau SDK Tableau-SDK-10-2-1.

       

      I’ve written a sample app that shows how we create extracts. The production CPU profiling shows a sizeable amount of time is spent in com.tableausoftware.common.StringUtils.ToTableauString[native]() so I chose to use strings for my sample app.


      Is there anything that we could be doing more efficiently, in terms of memory or CPU usage? Thanks for taking a look.

       

      import com.tableausoftware.TableauException;
      import com.tableausoftware.common.Type;
      import com.tableausoftware.extract.Extract;
      import com.tableausoftware.extract.Row;
      import com.tableausoftware.extract.Table;
      import com.tableausoftware.extract.TableDefinition;
      
      import java.io.File;
      import java.io.IOException;
      import java.util.ArrayList;
      import java.util.List;
      import java.util.Random;
      
      public class TestApp {
          public static void main(String[] args) throws IOException {
              //delete existing extract file
              String extractFile = "extract.tde";
              deleteFile(extractFile);
      
              try (Extract extract = new Extract(extractFile)) {
                  //create table definition
                  TableDefinition tableDef = new TableDefinition();
                  int columnCount = 100;
                  for (int i = 0; i < columnCount; i++) {
                      tableDef.addColumn(Integer.toString(i), Type.UNICODE_STRING);
                  }
      
                  //create a table
                  Table table = extract.addTable("Extract", tableDef);
      
                  //create a sample document
                  //reuse the document for each row to minimize the non-tableau CPU and memory consumption
                  Random random = new Random();
                  List<String> rowData = new ArrayList<>();
                  for (int i = 0; i < columnCount; i++) {
                      rowData.add(Integer.toString(random.nextInt()));
                  }
      
                  long startTime = System.nanoTime();
                  System.out.println("Populating table...");
      
                  //populate the table
                  int tableRows = 4000000;
                  for (int i = 0; i < tableRows; i++) {
                      //convert input document to tableau row
                      Row row = new Row(tableDef);
                      for (int j = 0; j < rowData.size(); j++) {
                          row.setString(j, rowData.get(j));
                      }
      
                      //add the row to the table
                      table.insert(row);
                  }
      
                  long endTime = System.nanoTime();
                  System.out.println(String.format("Total time (ms): %s", (endTime - startTime) /
                          1000000));
      
              } catch (TableauException e) {
                  e.printStackTrace();
              }
          }
      
          private static void deleteFile(String file) {
              File extractFile = new File(file);
              if (extractFile.exists()) {
                  extractFile.delete();
              }
          }
      }
      

       

      Dev machine profiling. My machine has 8 logical cores and 16GB RAM.

      overall_stats.pngcpu_stats.pngmemory_stats.png

        • 2. Re: Creating an extract efficiently with the Java SDK
          Jeff Strauss

          Rob, it's a great explanation and analysis that you are doing, so first off thanks for including all the details.  As I understand it, you don't want to adversely affect the rendering of vizzes by the build of the extracts?  One option is to isolate the build of the extracts onto its own server.  And then leverage tabcmd publish after it's successsfully populated into order to make it resident on TS.  And I don't believe you need TS licensing in order to build extracts on a separate machine.

          • 3. Re: Creating an extract efficiently with the Java SDK
            Rob R

            Thanks for the reply. We actually do build the extract on a separate server. At a high-level you’re still correct though. The extract generation is part of a larger product running on the server. Our concern is the amount of CPU time the extract API uses to process the input data, adversely affecting the amount of CPU time available for everything else.

             

            We can’t use tabcmd in this case because we need to support Linux environments. I’ve seen posts here to workaround the Windows limitation, but we’d prefer to stick with the supported options because this is a product deployed in customer environments.

             

            We were hoping that perhaps we were doing something inefficiently and would be able to make a software change before suggesting a hardware change to our customers.

            • 4. Re: Creating an extract efficiently with the Java SDK
              Jeff Strauss

              Hey Rob.  An alternative to tabcmd that you may explore is the REST API as there are functions within that allow you to publish to TS.  https://onlinehelp.tableau.com/current/api/rest_api/en-us/help.htm#REST/rest_api_ref.htm#Publish_Datasource%3FTocPath%3D…

               

              Also, here are my thoughts on the backgrounder refreshes:

               

              1. The  extract refreshes are CPU intensive.  And you've obviously seen this already.  In our deployment, we run 4 backgrounders on an isolated server cluster node (16 logical cores) and CPU does spike to 100%

               

              2. My experience has been that custom extract API's refresh slower than the standard built-in refresh.  Have you tested this out to see if there's a significant time difference or not in your case?  It could be because I wrote the extract using Python and perhaps there are more optimal programming methods for speeding it up.

               

              3. The primary ways that I have found for speeding up the extract refreshes are:

              • Hide the columns that are not needed
              • Decrease the # of rows that need to be imported
              • Change the underlying disk architecture to have IOPS go really fast.  We run RAID-10 and have sufficient throughput with very little latency
              • Increase CPU clock speed.  See this post where our cumulative extract times for our top priority schedule were cut in half Performance improvement - rendering and extracts - big find
              • 5. Re: Creating an extract efficiently with the Java SDK
                Jaros Kachniarz

                Unfortunately our experience is very similar. We have bigger documents with about 500 fields but our speed is only about 2.5 mln rows per hour. It looks like the SDK does not scale up well, as it is single threaded component. The only solution is to increase CPU speed but there is no much we can do here, as I have not seen processors with 100 GHz clocks yet.

                We reported it to Tableau support but I am not very optimistic. It is a road blocker for our project. Any thoughts or ideas?

                 

                Thanks Jaros