yes, file size tends to increase when deleting a tuple.
This is because Hyper only marks the tuple as deleted but doesn't actually re-compatify the whole table.
In a way, a DELETE will leave a hole in your data, i.e. some unused space in the middle of your table.
These holes will be filled again as you insert new data after your DELETE.
By the way, in case you were not aware of it, yet:
Deleting a row from a Hyper file is not a safe way to discard of this data.
E.g., I can usually recover deleted rows from the Hyper file using a hex-editor.
So don't use DELETE to get rid of data which really needs to be deleted in an unrecoverable way.
So far, this only explains why DELETE doesn't decrease the file size.
The observed increase now comes from bad feature interaction inside Hyper:
As soon as a table chunk contains a row which is marked as deleted, this inhibits our compression and we store that table chunk uncompressed.
By the way, a similar thing also applies for DROP TABLE.
DROP TABLE will also leave a hole in your database file but will not immediately reorganize all other tables within the file and free the corresponding disk space.
The disk space will only be reused as soon you create another table/insert additional data into an already existing table, but the file will usually not shrink in size.
We know that both of those issues are very unsatisfying, and we could envision two solutions:
* get compression to work also in the presence of DELETEd tuples
* introducing a VACUUM command which does "garbage collection" and frees all unused space inside the Hyper file
I can't give you a time line for either of those features.
In the meantime, you can use the following workaround:
Recreate your Hyper file from scratch, directly excluding the tuples you want to delete.
Note that this is easier/more efficient than it might sound, as Hyper can copy tables between files.
E.g., in Python:
with Connection(hyper.endpoint) as connection:
connection.execute_command('CREATE TABLE output.public."Extract" AS SELECT * FROM input.public."Extract" WHERE my_column > 50')
The trick here is to use the CREATE TABLE AS to create a table in the output file based on the data in the input file, but only copy over parts of it, thanks to the WHERE condition.
The exact SQL might need some tweaking to match your table names
Thanks for illustrating the workaround, I wouldn't have understood this without reading the post.
Obviously true crud operations are ideal for this scenario and I toss in my vote for an in place operation.