-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move sort & join operations to SQLite when possible #1317
Comments
@norberttech I think we should investigate if we should offload to the sqlite only the necessary fields that are used for sorting instead of inserting the whole row. Then, based on the returned ids, that could be the hash of the row or the index in the array, we return the real data. |
So let say we want to sort by Then in SQLite db, we need to create an id column (type based on the flow column type probably int) and date column (probably also int as there is no native date type). So then when we execute the query and iterate over the results we only read that Why do we need to store those serialized rows that are not even used in the sort operation? Because otherwise, we would need to go to the source dataset and read the data in exactly the same order as an output of the sort operation which for most of the datasets would not be scalable at all. We also can't keep the whole dataset in the memory when SQLite sort operation is executed as datasets might be bigger than available memory. That's why I think the only reasonable option is to store serialized rows in the SQLite and just yield them back directly from it but maybe I'm missing something here. I hope it makes more sense now! |
The main reason why I was trying not to store a serialized row in the database is that serialization is also a really expensive operation. But, your arguments seem valid ones. I will leave the theory for now since creating a PoC will help us test any assumptions. |
if we could figure out a way to avoid serialization that would be fantastic, but I’m not sure if that's even possible :/ |
This is an experimental feature that might not work at all.
So the idea is to whenever possible (or when configured) to offload those heavy tasks to better optimized engine.
We should start from sorting since its much easier. What needs to happen is following:
One important thing here is that sqlite db should be created and configured in a way that allows for:
Once a proof of concept is created we should measure it performance and compare with native sort. The main bottleneck might happen at writing/reading to/from sqlite.
My recommendation would be to start with sqlite3 extension and if it works provide also pdo_sqlite alternative.
I would make it part of the core/etl and detect if sqlite is even available
The text was updated successfully, but these errors were encountered: