3 Replies Latest reply on Dec 5, 2019 6:54 AM by Ross Helenius

    File Size after Hyperapi delete

    Selva Prakash S

      I tried deleting records from Hyper files using Python Hyperapi. It works as expected functionally. But I see that the size of the file increases after the deletion. This happens every time. Why does this happen? Is there any way to get around it?

        • 1. Re: File Size after Hyperapi delete
          Adrian Vogelsgesang

          Hi Selva,

           

          yes, file size tends to increase when deleting a tuple.
          This is because Hyper only marks the tuple as deleted but doesn't actually re-compatify the whole table.
          In a way, a DELETE will leave a hole in your data, i.e. some unused space in the middle of your table.

          These holes will be filled again as you insert new data after your DELETE.

           

          By the way, in case you were not aware of it, yet:
          Deleting a row from a Hyper file is not a safe way to discard of this data.
          E.g., I can usually recover deleted rows from the Hyper file using a hex-editor.
          So don't use DELETE to get rid of data which really needs to be deleted in an unrecoverable way.


          So far, this only explains why DELETE doesn't decrease the file size.

          The observed increase now comes from bad feature interaction inside Hyper:

          As soon as a table chunk contains a row which is marked as deleted, this inhibits our compression and we store that table chunk uncompressed.

           

          By the way, a similar thing also applies for DROP TABLE.
          DROP TABLE will also leave a hole in your database file but will not immediately reorganize all other tables within the file and free the corresponding disk space.
          The disk space will only be reused as soon you create another table/insert additional data into an already existing table, but the file will usually not shrink in size.

           

          We know that both of those issues are very unsatisfying, and we could envision two solutions:
          * get compression to work also in the presence of DELETEd tuples
          * introducing a VACUUM command which does "garbage collection" and frees all unused space inside the Hyper file
          I can't give you a time line for either of those features.

           

          In the meantime, you can use the following workaround:
          Recreate your Hyper file from scratch, directly excluding the tuples you want to delete.
          Note that this is easier/more efficient than it might sound, as Hyper can copy tables between files.
          E.g., in Python:
                with Connection(hyper.endpoint) as connection:

                      connection.catalog.attach_database(input_file, alias=f"input")

                      connection.catalog.create_database(output_file)

                      connection.catalog.attach_database(output_file, alias=f"output")

                      connection.execute_command('CREATE TABLE output.public."Extract" AS SELECT * FROM input.public."Extract" WHERE my_column > 50')


          The trick here is to use the CREATE TABLE AS to create a table in the output file based on the data in the input file, but only copy over parts of it, thanks to the WHERE condition.
          The exact SQL might need some tweaking to match your table names

           

          Cheers,
          Adrian

          • 2. Re: File Size after Hyperapi delete
            Adrian Vogelsgesang

            Oh, one more ask...


            Selva Prakash S, can you describe your specific use case in more detail?
            So we can make sure that we build the right thing to solve your concrete issue

            • 3. Re: File Size after Hyperapi delete
              Ross Helenius

              Thanks for illustrating the workaround, I wouldn't have understood this without reading the post.

               

              Obviously true crud operations are ideal for this scenario and I toss in my vote for an in place operation.