Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Count only partitions #659

Closed
wants to merge 1 commit into from
Closed

Count only partitions #659

wants to merge 1 commit into from

Conversation

kuba--
Copy link
Contributor

@kuba-- kuba-- commented Apr 4, 2019

Signed-off-by: kuba-- [email protected]
Closes #658

Basically #658 should be already closed, because we already count only partitions so progress looks better (does not overflow total).
This PR (just in case) memorize partitions and updates progress only for new ones.

TL;DR
The total number for process list status is a number of partitions (what so far is equal number of repositories). Previously wrong fraction was related to number_of_rows/number_of_partitions. We (I assume @erizocosmico ) has already changed the implementation and updates progress when we get EOF (what means after process partition). This PR just memorizes string keys (what so far are partition keys) and updates progress only for keys which does not exist. It's just in case change which allow us change progress counter in the future (it can be per partition, per repo, ...).

@kuba-- kuba-- added the bug Something isn't working label Apr 4, 2019
@kuba-- kuba-- requested a review from a team April 4, 2019 14:35
@erizocosmico
Copy link
Contributor

As I explained in the issue, we can't solve this. Not without fully rethinking how we do inner joins.
Partitions may be iterated more times than they should if the inner join is done in multipass mode and there is just no way we can fix that because we don't know how many times we will iterate a partition.

While this fix would technically fix it because there would be no more overflow, it's not correctly reporting to the user. Yes, once a partition has been seen we mark it as done, but is it really? Because we may iterate that partition another 3000 times after that.

If you have 300 rows on the left side and 300 partitions on the right side once you've iterated fully the right side (and only 1 row of the left side) you will have 300/300 reported in to the user. But is it really a 300/300? Because every partition will be iterated 299 more times and the query is going to take a lot after the user thinks it may be finishing.

@kuba--
Copy link
Contributor Author

kuba-- commented Apr 4, 2019

Yes, totally agreed, but I think people mainly looking into the state to check if it's doing something.
So maybe, we'll keep partition fraction and add an extra rows counter. Thanks to this we will have some frequently changing counter.
I hope it won't have a bad impact on performance.

@kuba--
Copy link
Contributor Author

kuba-- commented Apr 4, 2019

It's similar to counting how many partitions we processed over time and based on total predict how much time left.

Signed-off-by: kuba-- <[email protected]>
@kuba--
Copy link
Contributor Author

kuba-- commented May 24, 2019

I don't believe we gonna merge this PR and it's hanging for a while, so I decided to close it.

@kuba-- kuba-- closed this May 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong state of SquashedTable in SHOW PROCESSLIST
3 participants