1 Reply Latest reply on Aug 30, 2016 1:47 PM by eric templeton

    Python Tableau SDK - Issue with multithreading

    . videla



      I am working on creating TDEs outside Tableau using Python Tableau SDK. I downloaded Python SDK and started enhancing csv2tde.py program to work for my case.

      Converting 22M record csv file (40 columns, raw size : 4GB, .gz file size: 190MB) to TDE It took 1 hour 15 minutes.I tried multiple producers (each thread reading file, converting into row format needed by Tableau) and one consumer (doing final table.insert(row)) approach.


      Producers: I read csv file line by line, appending into a list, on hitting 25K rows - I am submitting a thread by invoking addtoQ function. Below is relevant code snippet.

          def addtoQ(list_lines):       

              for line in list_lines:

                  row = Row(tableDef)

                  colNo = 0

                  for field in line:

                      if (colTypes[colNo] != Type.UNICODE_STRING and field == ""):



                          fieldSetterMap[colTypes[colNo]](row, colNo, field);

                      colNo += 1   



          list_lines = []

          cnt = 0

          pool = ThreadPoolExecutor(max_workers=5)

          for line in csvReader:


              cnt = cnt+1

              if cnt % 25000 == 0:

                  print time.strftime("%Y-%m-%d %H:%M:%S ")+ " 25K rows reading completed"

                  future = pool.submit(addtoQ, list_lines)

                  cnt = 0

                  list_lines = []


      addtoQ function is taking 5 seconds for 25K rows when i run in sequential.

      5 parallel threads  -  they are taking around 50 seconds to complete. In effect, worse than sequential execution. I suspect some blocking is going on resulting in longer time. I really appreciate your help in this optimization.