fix: Improve quorum caching (again) #5761

UdjinM6 · 2023-12-11T18:50:10Z

Issue being fixed or feature implemented

scanQuorumsCache is a special one and we use it incorrectly.
Platform doesn't really use anything that calls ScanQuorums() directly, they specify the exact quorum hash in RPCs so it's GetQuorum() that is used instead. The only place ScanQuorums() is used for Platform related stuff is StartCleanupOldQuorumDataThread() because we want to preserve quorum data used by GetQuorum(). But this can be optimised with its own (much more compact) cache.
RPCs that use ScanQuorums() should in most cases be ok with smaller cache, for other use cases there is a note in help text now.

What was done?

pls see individual commits

How Has This Been Tested?

run tests, run a node (~~in progress~~ looks stable)

Breaking Changes

n/a

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated relevant unit/integration/functional/e2e tests
I have made corresponding changes to the documentation
I have assigned this pull request to a milestone

…ing phase

…keepOldConnections`

… values might be CPU/disk heavy

ogabrielides

utACK

PastaPastaPasta · 2023-12-17T18:21:44Z

Is this not a breaking change in the sense that the response of some RPCs may have been changed / reduced, in a way that is not fully backwards compatible? Maybe I am mis-understanding the scope of the potential RPC changes.

But I think if an RPC previously would have returned say 10 quorums but now only returns 5 that'd be a breaking change no?

src/llmq/quorums.cpp

UdjinM6 · 2023-12-17T19:30:29Z

There should be no changes in RPC results, in 6200640 we simply warn users that fast results are only granted for active quorum sets. It's actually an improvement because keepOldKeys value means quorums, not blocks while scanQuorumsCache is caching quorum sets per block. So we were using cache for keepOldKeys of recently requested blocks only here. And even scanning back with the default count (signingActiveQuorumCount) was not using cache for the most part really (only platform quorums were using cache all the time thanks to the huge keepOldKeys).

src/llmq/quorums.cpp

src/llmq/quorums.h

src/llmq/quorums.cpp

PastaPastaPasta

utACK for squash merge

## Issue being fixed or feature implemented 1. `scanQuorumsCache` is a special one and we use it incorrectly. 2. Platform doesn't really use anything that calls `ScanQuorums()` directly, they specify the exact quorum hash in RPCs so it's `GetQuorum()` that is used instead. The only place `ScanQuorums()` is used for Platform related stuff is `StartCleanupOldQuorumDataThread()` because we want to preserve quorum data used by `GetQuorum()`. But this can be optimised with its own (much more compact) cache. 3. RPCs that use `ScanQuorums()` should in most cases be ok with smaller cache, for other use cases there is a note in help text now. ## What was done? pls see individual commits ## How Has This Been Tested? run tests, run a node (~in progress~ looks stable) ## Breaking Changes n/a ## Checklist: - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone

…ms (#5784) ## Issue being fixed or feature implemented Cache population for old quorums is a cpu heavy operation and should be avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is critical for testnet and other small network because every mn participate in almost every platform quorum and cache population for 2 months of quorums can easily block everything for 15+ minutes on a 4 cpu node. On mainnet quorum distribution is much better but it's still a small waste of cpu (or not so small for unlucky nodes). #5761 follow-up ## What was done? Do not start cache population for outdated quorums, improve logs in `StartCachePopulatorThread` to make it easier to see what's going on. ## How Has This Been Tested? run a mn on testnet ## Breaking Changes n/a ## Checklist: - [x] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

…ms (dashpay#5784) ## Issue being fixed or feature implemented Cache population for old quorums is a cpu heavy operation and should be avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is critical for testnet and other small network because every mn participate in almost every platform quorum and cache population for 2 months of quorums can easily block everything for 15+ minutes on a 4 cpu node. On mainnet quorum distribution is much better but it's still a small waste of cpu (or not so small for unlucky nodes). dashpay#5761 follow-up ## What was done? Do not start cache population for outdated quorums, improve logs in `StartCachePopulatorThread` to make it easier to see what's going on. ## How Has This Been Tested? run a mn on testnet ## Breaking Changes n/a ## Checklist: - [x] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

## Issue being fixed or feature implemented 1. `scanQuorumsCache` is a special one and we use it incorrectly. 2. Platform doesn't really use anything that calls `ScanQuorums()` directly, they specify the exact quorum hash in RPCs so it's `GetQuorum()` that is used instead. The only place `ScanQuorums()` is used for Platform related stuff is `StartCleanupOldQuorumDataThread()` because we want to preserve quorum data used by `GetQuorum()`. But this can be optimised with its own (much more compact) cache. 3. RPCs that use `ScanQuorums()` should in most cases be ok with smaller cache, for other use cases there is a note in help text now. ## What was done? pls see individual commits ## How Has This Been Tested? run tests, run a node (~in progress~ looks stable) ## Breaking Changes n/a ## Checklist: - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone

…ms (dashpay#5784) ## Issue being fixed or feature implemented Cache population for old quorums is a cpu heavy operation and should be avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is critical for testnet and other small network because every mn participate in almost every platform quorum and cache population for 2 months of quorums can easily block everything for 15+ minutes on a 4 cpu node. On mainnet quorum distribution is much better but it's still a small waste of cpu (or not so small for unlucky nodes). dashpay#5761 follow-up ## What was done? Do not start cache population for outdated quorums, improve logs in `StartCachePopulatorThread` to make it easier to see what's going on. ## How Has This Been Tested? run a mn on testnet ## Breaking Changes n/a ## Checklist: - [x] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

UdjinM6 added the backport-candidate-20.1.x label Dec 11, 2023

UdjinM6 added this to the 20.1 milestone Dec 11, 2023

UdjinM6 added 6 commits December 13, 2023 19:56

fix: scanQuorumsCache should only store cache for blocks in the min…

b07c5cf

…ing phase

fix: scanQuorumsCache should have enough space to store all quorums

a6e3b05

fix: the vector we store in scanQuorumsCache should be limited by `…

f7791d3

…keepOldConnections`

doc: add a note to quorum list and quorum memberof RPCs that some…

6200640

… values might be CPU/disk heavy

refactor: introduce and use max_store_depth() helper

4d13a30

fix: Add cleanup cache to avoid excessive quorum scans

27bef79

UdjinM6 force-pushed the fix_quorum_caching_again branch from 6549784 to 27bef79 Compare December 13, 2023 17:31

UdjinM6 marked this pull request as ready for review December 14, 2023 14:06

UdjinM6 requested review from ogabrielides, knst and PastaPastaPasta December 14, 2023 14:06

ogabrielides previously approved these changes Dec 16, 2023

View reviewed changes

knst reviewed Dec 17, 2023

View reviewed changes

src/llmq/quorums.cpp Outdated Show resolved Hide resolved

fix comment in ScanQuorums

d241a65

UdjinM6 dismissed ogabrielides’s stale review via d241a65 December 17, 2023 19:32

refactor: add and use max_cycles() helper

947c626

UdjinM6 force-pushed the fix_quorum_caching_again branch from bb47181 to 947c626 Compare December 17, 2023 19:43

PastaPastaPasta requested changes Dec 18, 2023

View reviewed changes

fix: apply some suggestions, adjust few things

f614d95

PastaPastaPasta reviewed Dec 18, 2023

View reviewed changes

src/llmq/quorums.cpp Outdated Show resolved Hide resolved

refactor: use try_emplace

a196812

PastaPastaPasta approved these changes Dec 20, 2023

View reviewed changes

PastaPastaPasta merged commit 6fe36cc into dashpay:develop Dec 20, 2023
11 checks passed

UdjinM6 mentioned this pull request Dec 22, 2023

fix: ScanQuorums should not start cache population for outdated quorums #5784

Merged

5 tasks

UdjinM6 modified the milestones: 20.1, 20.0.3 Dec 24, 2023

UdjinM6 removed the backport-candidate-20.1.x label Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve quorum caching (again) #5761

fix: Improve quorum caching (again) #5761

UdjinM6 commented Dec 11, 2023 •

edited

Loading

ogabrielides left a comment

PastaPastaPasta commented Dec 17, 2023

UdjinM6 commented Dec 17, 2023

PastaPastaPasta left a comment

fix: Improve quorum caching (again) #5761

fix: Improve quorum caching (again) #5761

Conversation

UdjinM6 commented Dec 11, 2023 • edited Loading

Issue being fixed or feature implemented

What was done?

How Has This Been Tested?

Breaking Changes

Checklist:

ogabrielides left a comment

Choose a reason for hiding this comment

PastaPastaPasta commented Dec 17, 2023

UdjinM6 commented Dec 17, 2023

PastaPastaPasta left a comment

Choose a reason for hiding this comment

UdjinM6 commented Dec 11, 2023 •

edited

Loading