Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move sort & join operations to SQLite when possible #1317

Open
norberttech opened this issue Jan 1, 2025 · 4 comments
Open

Move sort & join operations to SQLite when possible #1317

norberttech opened this issue Jan 1, 2025 · 4 comments

Comments

@norberttech
Copy link
Member

This is an experimental feature that might not work at all.

So the idea is to whenever possible (or when configured) to offload those heavy tasks to better optimized engine.

We should start from sorting since its much easier. What needs to happen is following:

  • create sqlite db with columns reflecting sort by columns, and their types.
  • serialize the entire row and save it as serialized string in blob column
  • read from sorted results of SQL query and yield out unserialized row tomake sure nothing changed

One important thing here is that sqlite db should be created and configured in a way that allows for:

  • setting maximum allowed memory consumption
  • we need to allow to pass serializer that would allow encrypt serialized row in order to avoid any data leaks
  • once reading is done db should be removed
  • to defines where that temporary db is created (in memory should also be an option)

Once a proof of concept is created we should measure it performance and compare with native sort. The main bottleneck might happen at writing/reading to/from sqlite.

My recommendation would be to start with sqlite3 extension and if it works provide also pdo_sqlite alternative.

I would make it part of the core/etl and detect if sqlite is even available

@norberttech norberttech moved this to Todo in Roadmap Jan 1, 2025
@norberttech norberttech added this to the 0.11.0 milestone Jan 1, 2025
@Bellangelo
Copy link
Contributor

@norberttech I think we should investigate if we should offload to the sqlite only the necessary fields that are used for sorting instead of inserting the whole row. Then, based on the returned ids, that could be the hash of the row or the index in the array, we return the real data.

@norberttech
Copy link
Member Author

@norberttech I think we should investigate if we should offload to the sqlite only the necessary fields that are used for sorting instead of inserting the whole row. Then, based on the returned ids, that could be the hash of the row or the index in the array, we return the real data.

So let say we want to sort by id, date

Then in SQLite db, we need to create an id column (type based on the flow column type probably int) and date column (probably also int as there is no native date type).
Once those columns are created, one more is needed: _serialized_dataframe_row_ or something similar with the entire row serialized.

So then when we execute the query and iterate over the results we only read that _serialized_dataframe_row_ and yield it as is without any type conversions (the id and date will stay exactly as they were serialized).

Why do we need to store those serialized rows that are not even used in the sort operation? Because otherwise, we would need to go to the source dataset and read the data in exactly the same order as an output of the sort operation which for most of the datasets would not be scalable at all.

We also can't keep the whole dataset in the memory when SQLite sort operation is executed as datasets might be bigger than available memory.

That's why I think the only reasonable option is to store serialized rows in the SQLite and just yield them back directly from it but maybe I'm missing something here.

I hope it makes more sense now!

@Bellangelo
Copy link
Contributor

The main reason why I was trying not to store a serialized row in the database is that serialization is also a really expensive operation. But, your arguments seem valid ones. I will leave the theory for now since creating a PoC will help us test any assumptions.

@norberttech
Copy link
Member Author

The main reason why I was trying not to store a serialized row in the database is that serialization is also a really expensive operation. But, your arguments seem valid ones. I will leave the theory for now since creating a PoC will help us test any assumptions.

if we could figure out a way to avoid serialization that would be fantastic, but I’m not sure if that's even possible :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants