4 Replies Latest reply on Dec 5, 2019 1:49 PM by Martin Henrion

    Tableau Hyper API, geometry objects, and memory issues

    Martin Henrion

      First of all, I'm generally loving the new Hyper API. Nice work team! Much easier to work with and great documentation to go with it! But I've come across a problem I'm keen to hear Tableau's thoughts on. Adrian Vogelsgesang

       

      I have the following situation which causes a memory error when using the Hyper API in Python in Windows to read a large view (containing postgis geometry objects as WKT) from postgres into the .hyper file first as text (no problem), and then casting to geometry as suggested in the Tableau Hyper API Reference (problem arises).

       

      My code works fine on smaller (but still quite large) datasets, even those including geometry objects, so I am 100% certain that it is not a problem with my code. The issue arises when casting the WKT text fields to geography type fields within the hyper file.

       

      The situation:

      • Very large table (IDs only) in postgres which contains numeric data including IDs of geometry objects (but not the objects themselves)
      • A view, which joins the geometry objects to the aforementioned table; I note that this view, if materialized at once would be unmanageably large and definitely cause memory errors if not stored in a columnar format like a Hyper extract
      • View is read one row at a time by iterating over a cursor object and added to a Hyper API first with the geometry objects as text (in WKT format)
      • The above temporary hyper table containing geometry objects as SqlType.text() is then inserted into a table within the same hyper file where the geometry objects are specified type SqlType.geography(), as directed in the above linked reference.
      • During the hyper INSERT process, all memory is consumed (it gradually climbs and climbs before using all available RAM / cache) and the program crashes; I note here that it did not crash when I added the data to the extract originally one row at a time

       

      So my hypothesis is that the Hyper API might be re-creating the entire non-columnar row-style view of the data table in memory during the insertion, which is obviously not desirable in situations like this where the table is impractically large if not kept in columnar format. Would be good to hear from someone from Tableau if this is the case?

       

      The code (simplified it here as it's just for context / indentation lost on paste, sorry / see red highlight for where problem occurs):

       

       

      # MAIN -----------------------------------------------------------------------------------------------------------------
      def main():

      """
      Procedure for creating Tableau Hyper files from a postgresql database.
      :return:
      """

             # get table info – NOTE: def getTableInfo is a process for creating hyper table definition objects based on the database table/view to be read
      print("Create Tableau table definition.")
      table_definition_geom_as_text
      , table_definition, column_names, column_types, row_count \
      = getTableInfo(database_connection_object
      , hyper_file_name)

             # prepare to read data
      sql = f"SELECT * FROM {str(options['tablename'])};"
      r = # row counter
      cur = cursor  # a pyodbc cursor object connected to the table/view to be read from the database


      # select a hyper file creation mode
      if options['overwrite']:
      file_creation = CreateMode.CREATE_AND_REPLACE
      elif options['append']:
      file_creation = CreateMode.NONE
      else:
      file_creation = CreateMode.CREATE

      # power up the Hyper service
      print("Starting Hyper service.")
      with HyperProcess(telemetry=Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU,
      hyper_path=r'C:\Python37\Lib\site-packages\tableauhyperapi\bin\hyper',
      parameters={'log-dir': r'C:\Users\Public'}
      )
      as ths:

      # connect a hyper file to the service
      with Connection(ths.endpoint, options['filename'], file_creation) as hypercon:

      # set up the tables
      hypercon.catalog.create_schema('Extract')
      hypercon.catalog.create_table(table_definition_geom_as_text)
      hypercon.catalog.create_table(table_definition)

      # add the data
      print("Adding data to temporary table (geometries as text).")
      with Inserter(hypercon, table_definition_geom_as_text) as writer:


      # add the data
      for row in cur.execute(sql):

      # convert the record tuple to a list
      one_row = list(row)
      writer.add_row(one_row)

      writer.execute()

      # cast the text as geometry and remove temporary text table
      print("Casting geometries.")
                           hypercon.execute_command(
      command=f"INSERT INTO {TableName('Extract','Extract')} SELECT * FROM {TableName('Extract', 'GeomAsText')};")

                           print("Removing temporary table.")
      hypercon.execute_command(
      command=f"DROP TABLE {TableName('Extract', 'GeomAsText')};")

      # hypercon.close()  # this happens automatically through the "with" block

      # ths.shutdown()  # this happens automatically through the "with" block

      # tidy up
      cur.close()

             return 0

       

       

      The Error:

      Traceback (most recent call last):

        File "C:\Users\mghenri\qfm\v2019p1\global\exporting\pgtohypercmd.py", line 345, in <module>

          return_value = main()

        File "C:\Users\mghenri\qfm\v2019p1\global\exporting\pgtohypercmd.py", line 321, in main

          f"INSERT INTO {TableName('Extract','Extract')} SELECT * FROM {TableName('Extract', 'GeomAsText')};")

        File "C:\Python37\lib\site-packages\tableauhyperapi\connection.py", line 224, in execute_command

          row_count_cdata))

        File "C:\Python37\lib\site-packages\tableauhyperapi\impl\dllutil.py", line 110, in check

          raise errp.to_exception()

      tableauhyperapi.hyperexception.HyperException: Cannot allocate 1552941056 bytes of memory: exceeding allocation limit of 25947294925 bytes for global memory limit

      Context: 0x5fdfad59

       

      Thoughts from Tableau on the inner workings of the API that might be causing this issue?

        • 1. Re: Tableau Hyper API, geometry objects, and memory issues
          Adrian Vogelsgesang

          > So my hypothesis is that the Hyper API might be re-creating the entire non-columnar row-style view of the data table in memory during the insertion

           

          That hypothesis is pretty close to what is going on during an INSERT, although not 100% accurate.
          On insertion, we always buffer up 131,072 tuples in a row-style intermediate buffer (easier to append new rows to) and then hand the whole batch of to compression which turns it into a columnar representation.

           

          However, I don't think this is the root cause for the issues you are seeing, since

          > which is impractically large if not kept in columnar format

          I tend to disagree here. Note that data in columnar and row-wise format usually takes approximately the same amount of storage. Columnar vs. row-wise is really just about access times/caching effects which influence query time, but not so much about memory usage.

          That being said, columnar storage is usually combined with compression, while row-wise is usually uncompressed (and this is why people generally tend to think "columnar == smaller").
          In the case of geography, Hyper usually doesn't apply a lot of compression (except if you have a lot of duplicated geometries in your data. Do you?).
          Hence, I don't think that columnar storage or rather the lack of it is the issue here.

           

          Also, I would expect the Geography-type representation to be approx. the same size as the corresponding WKT representation.
          So I am really a bit surprised that inserting the text data works, but we are then running out of memory for the insertion.

           

          Without a repro, I am mostly guessing here, though.
          Can you provide a repro?
          E.g., are you using some generally publicly available geography shapes on which we could repro this?

          Also, the hyperd.log would be very useful.

           

          Cheers,
          Adrian

          • 2. Re: Tableau Hyper API, geometry objects, and memory issues
            Martin Henrion

            On insertion, we always buffer up 131,072 tuples in a row-style intermediate buffer (easier to append new rows to) and then hand the whole batch of to compression which turns it into a columnar representation.

            Interesting! Is there some check done here to ensure you're not attempting to buffer more than the memory of the machine can handle?

             

            In the case of geography, Hyper usually doesn't apply a lot of compression (except if you have a lot of duplicated geometries in your data. Do you?).

            Yes, very much duplication. The table is basically a huge list of movements (from origin, to destination, of a particular type of commodity) and attached to each record (after the aforementioned join that happens in the what-would-be-extremely-large-if-materialized view) is the zone geometry of the origin and destination (in WKT format). There are only 75 unique origins or destinations, but there are around 30 million of row of data, which is why I suggested that it's so much smaller in columnar format.

             

            Also, I would expect the Geography-type representation to be approx. the same size as the corresponding WKT representation.

            Yep, I agree, and it probably is. I suspect if I tried to do this same exercise and just directly insert from one identical table to another (text to text) the same issue would arise because of the level of duplication and thus the efficiency of columnar storage in this case, which is perhaps lost during insertion exercises. I don't think its a problem with the .geography() type.

             

            Can you provide a repro?

            It's not public data, but I will email you then we can report back here (I just can't share the data/repo here).

             

            And thanks for your replies/help!

            • 3. Re: Tableau Hyper API, geometry objects, and memory issues
              Adrian Vogelsgesang

              After some private emails going back and forth between Martin and me, there are now two solutions:

               

              1. HyperAPI got patched

               

              The memory management around geography were fixed as part of yesterday's release (Hyper API 0.0.9273 | Tableau Software; see Changelog: What's New in the Hyper API).

              Hyper now needs way less memory for ingesting geography.


              With this fix, Martin's use case no longer runs out of memory. Unfortunately, it still takes more than 6 hours to finish the WKT import. (I aborted the query after 6 hours, I am not how long it would have taken to complete). The old workflow using Extract API 2.0 used to take 2 days to rebuild the Extract. Hence, this is not a regression compared to Extract API 2.0. But still, I this is an unbearable solution

               

              2. Avoid duplication introduced by denormalizing the data

               

              In Martin's use case, all geography objects are originally stored in dimension tables. Those dimension tables are rather small (~100 entries). However, the remote Postgres server already pre-joined those geography objects with the fact table containing multiple million rows of data. Thereby, the data volume was inflated by a lot.


              We can avoid that inflation by creating normalized extracts instead of denormalizing the data.

               

              Initial experiments with switching to normalized extract creation showed promising results:

              • Ingesting the geography objects now worked, even with the old version of HyperAPI which still had the memory management issues around geography
              • Creating the complete extract now takes 45 minutes instead of 2 days.
              • Viz-performance & usability are comparable. We did not see any regressions (yet)

               

              And even if we would still find a regression/show-stopper for using a normalized extract, we can always circumvent those issues, by denormalizing the data inside Hyper using something like

                 INSERT INTO “Extract”
                 SELECT f.*, d1.geo, d2.geo
                 FROM fact f
                     JOIN dimension1_geo d1 ON d1.geo_id = f.dimension1_key
                     JOIN dimension2_geo d2 ON d2.geo_id = f.dimension2_key

              directly against the HyperAPI.

               

              In this case, the final extract would still be denormalized.
              However, by doing the denormalization as the last step during extract creation, we decreased the amount of work and data size to be moved to create the Extract by multiple orders of magnitude.

              • 4. Re: Tableau Hyper API, geometry objects, and memory issues
                Martin Henrion

                Thanks very much for all your help in terms of both bug fixes, and alternate solutions Adrian Vogelsgesang!