7 Replies Latest reply on Jun 14, 2016 4:10 PM by Daniel Rappaport

    Tableau extract API Python and Java issues

    roman.feldblum

      My case use: 60 million rows, 800 variables.

      PC: 32 GB memory Intel I7.

       

      I built Python script and identical functionality with Java.

       

      Data was loaded from CSV file.

       

      Test results below apply to both Python and Java APIs.

       

      When testing with 10,000, 100,000 and 1 million sample records, Tableau extract was created.

       

      When testing with 33 million records sample, script (Python and Java) completed with no errors and data was flushed to TDE, but TDE was was empty.  Python script ran for 29 hours and Java jar 15 hours.

       

      It is possible that Tableau Extract API has some kind of memory limit. It would be great if API would be self sufficient and would flush data into TDE when reaching the limit on its own.

       

      Performance issues: Due to the nature of Tableau extract API - each field in each row need to be read individually to be added to Row object which is not very practical and very slow.  It is no problem for small files, but in my case 60 mil x 800 = 48 billion iterations.

       

      It would be great if Tableau extract API would support bulk load or it would be possible to interact with TDE to import data via SQL, like INSERT INTO tde.

       

      Java runs much faster.  And for very large data sets Java will outperform Python probably by 60 -80 % at least.

       

      Test - generate extract with 1 million records from CSV.

      Results:

      Python = 58 minutes

      Java = 30 minutes.

       

      Test - generate extract with 33 million records from CSV (note: no extract was generated even script was completed with no errors).

      Results:

      Python = 29 hours

      Java = 15 hours

       

      Suggestions:

      Python and Java API are great for small data sets, but no practical for very large record sets.

      It would be great to have some kind of CLI to communicate with TDE and to be able to import data via bulk import instead of one field and one row at the time.

      API seems to have some kind of memory limitations. With very large data sets no TDE was generated, even code completed with no errors. It would be great if Extract API would be self managed and Flush data into TDE when reaching it's own memory limits.