Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve quorum caching (again) #5761

Merged
merged 10 commits into from
Dec 20, 2023

Conversation

UdjinM6
Copy link

@UdjinM6 UdjinM6 commented Dec 11, 2023

Issue being fixed or feature implemented

  1. scanQuorumsCache is a special one and we use it incorrectly.
  2. Platform doesn't really use anything that calls ScanQuorums() directly, they specify the exact quorum hash in RPCs so it's GetQuorum() that is used instead. The only place ScanQuorums() is used for Platform related stuff is StartCleanupOldQuorumDataThread() because we want to preserve quorum data used by GetQuorum(). But this can be optimised with its own (much more compact) cache.
  3. RPCs that use ScanQuorums() should in most cases be ok with smaller cache, for other use cases there is a note in help text now.

What was done?

pls see individual commits

How Has This Been Tested?

run tests, run a node (in progress looks stable)

Breaking Changes

n/a

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation
  • I have assigned this pull request to a milestone

@UdjinM6 UdjinM6 added this to the 20.1 milestone Dec 11, 2023
@UdjinM6 UdjinM6 force-pushed the fix_quorum_caching_again branch from 6549784 to 27bef79 Compare December 13, 2023 17:31
@UdjinM6 UdjinM6 marked this pull request as ready for review December 14, 2023 14:06
ogabrielides
ogabrielides previously approved these changes Dec 16, 2023
Copy link
Collaborator

@ogabrielides ogabrielides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@PastaPastaPasta
Copy link
Member

Is this not a breaking change in the sense that the response of some RPCs may have been changed / reduced, in a way that is not fully backwards compatible? Maybe I am mis-understanding the scope of the potential RPC changes.

But I think if an RPC previously would have returned say 10 quorums but now only returns 5 that'd be a breaking change no?

src/llmq/quorums.cpp Outdated Show resolved Hide resolved
@UdjinM6
Copy link
Author

UdjinM6 commented Dec 17, 2023

There should be no changes in RPC results, in 6200640 we simply warn users that fast results are only granted for active quorum sets. It's actually an improvement because keepOldKeys value means quorums, not blocks while scanQuorumsCache is caching quorum sets per block. So we were using cache for keepOldKeys of recently requested blocks only here. And even scanning back with the default count (signingActiveQuorumCount) was not using cache for the most part really (only platform quorums were using cache all the time thanks to the huge keepOldKeys).

@UdjinM6 UdjinM6 force-pushed the fix_quorum_caching_again branch from bb47181 to 947c626 Compare December 17, 2023 19:43
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Show resolved Hide resolved
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Show resolved Hide resolved
src/llmq/quorums.cpp Show resolved Hide resolved
src/llmq/quorums.h Outdated Show resolved Hide resolved
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK for squash merge

@PastaPastaPasta PastaPastaPasta merged commit 6fe36cc into dashpay:develop Dec 20, 2023
11 checks passed
ogabrielides pushed a commit to ogabrielides/dash that referenced this pull request Dec 22, 2023
## Issue being fixed or feature implemented
1. `scanQuorumsCache` is a special one and we use it incorrectly.
2. Platform doesn't really use anything that calls `ScanQuorums()`
directly, they specify the exact quorum hash in RPCs so it's
`GetQuorum()` that is used instead. The only place `ScanQuorums()` is
used for Platform related stuff is `StartCleanupOldQuorumDataThread()`
because we want to preserve quorum data used by `GetQuorum()`. But this
can be optimised with its own (much more compact) cache.
3. RPCs that use `ScanQuorums()` should in most cases be ok with smaller
cache, for other use cases there is a note in help text now.

## What was done?
pls see individual commits

## How Has This Been Tested?
run tests, run a node (~in progress~ looks stable)

## Breaking Changes
n/a

## Checklist:
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have added or updated relevant unit/integration/functional/e2e
tests
- [ ] I have made corresponding changes to the documentation
- [x] I have assigned this pull request to a milestone
PastaPastaPasta pushed a commit that referenced this pull request Dec 22, 2023
…ms (#5784)

## Issue being fixed or feature implemented
Cache population for old quorums is a cpu heavy operation and should be
avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is
critical for testnet and other small network because every mn
participate in almost every platform quorum and cache population for 2
months of quorums can easily block everything for 15+ minutes on a 4 cpu
node. On mainnet quorum distribution is much better but it's still a
small waste of cpu (or not so small for unlucky nodes).

#5761 follow-up

## What was done?
Do not start cache population for outdated quorums, improve logs in
`StartCachePopulatorThread` to make it easier to see what's going on.

## How Has This Been Tested?
run a mn on testnet

## Breaking Changes
n/a

## Checklist:
- [x] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have added or updated relevant unit/integration/functional/e2e
tests
- [ ] I have made corresponding changes to the documentation
- [x] I have assigned this pull request to a milestone _(for repository
code-owners and collaborators only)_
ogabrielides pushed a commit to ogabrielides/dash that referenced this pull request Dec 24, 2023
…ms (dashpay#5784)

## Issue being fixed or feature implemented
Cache population for old quorums is a cpu heavy operation and should be
avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is
critical for testnet and other small network because every mn
participate in almost every platform quorum and cache population for 2
months of quorums can easily block everything for 15+ minutes on a 4 cpu
node. On mainnet quorum distribution is much better but it's still a
small waste of cpu (or not so small for unlucky nodes).

dashpay#5761 follow-up

## What was done?
Do not start cache population for outdated quorums, improve logs in
`StartCachePopulatorThread` to make it easier to see what's going on.

## How Has This Been Tested?
run a mn on testnet

## Breaking Changes
n/a

## Checklist:
- [x] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have added or updated relevant unit/integration/functional/e2e
tests
- [ ] I have made corresponding changes to the documentation
- [x] I have assigned this pull request to a milestone _(for repository
code-owners and collaborators only)_
@UdjinM6 UdjinM6 modified the milestones: 20.1, 20.0.3 Dec 24, 2023
PastaPastaPasta pushed a commit to ogabrielides/dash that referenced this pull request Dec 24, 2023
## Issue being fixed or feature implemented
1. `scanQuorumsCache` is a special one and we use it incorrectly.
2. Platform doesn't really use anything that calls `ScanQuorums()`
directly, they specify the exact quorum hash in RPCs so it's
`GetQuorum()` that is used instead. The only place `ScanQuorums()` is
used for Platform related stuff is `StartCleanupOldQuorumDataThread()`
because we want to preserve quorum data used by `GetQuorum()`. But this
can be optimised with its own (much more compact) cache.
3. RPCs that use `ScanQuorums()` should in most cases be ok with smaller
cache, for other use cases there is a note in help text now.

## What was done?
pls see individual commits

## How Has This Been Tested?
run tests, run a node (~in progress~ looks stable)

## Breaking Changes
n/a

## Checklist:
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have added or updated relevant unit/integration/functional/e2e
tests
- [ ] I have made corresponding changes to the documentation
- [x] I have assigned this pull request to a milestone
PastaPastaPasta pushed a commit to ogabrielides/dash that referenced this pull request Dec 24, 2023
…ms (dashpay#5784)

## Issue being fixed or feature implemented
Cache population for old quorums is a cpu heavy operation and should be
avoided for inactive quorums _at least_ oin `ScanQuorums`. This issue is
critical for testnet and other small network because every mn
participate in almost every platform quorum and cache population for 2
months of quorums can easily block everything for 15+ minutes on a 4 cpu
node. On mainnet quorum distribution is much better but it's still a
small waste of cpu (or not so small for unlucky nodes).

dashpay#5761 follow-up

## What was done?
Do not start cache population for outdated quorums, improve logs in
`StartCachePopulatorThread` to make it easier to see what's going on.

## How Has This Been Tested?
run a mn on testnet

## Breaking Changes
n/a

## Checklist:
- [x] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have added or updated relevant unit/integration/functional/e2e
tests
- [ ] I have made corresponding changes to the documentation
- [x] I have assigned this pull request to a milestone _(for repository
code-owners and collaborators only)_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants