Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel ledger close #4543

Merged
merged 10 commits into from
Jan 8, 2025
Merged

Conversation

marta-lokhova
Copy link
Contributor

@marta-lokhova marta-lokhova commented Nov 14, 2024

Resolves #4317
Concludes #4128

The implementation of this proposal requires massive changes to the stellar-core codebase, and touches almost every subsystem. There are some paradigm shifts in how the program executes, that I will discuss below for posterity. The same ideas are reflected in code comments as well, as it’ll be important for code maintenance and extensibility

Database access

Currently, only Postgres DB backend is supported, as it required minimal changes to how DB queries are structured (Postgres provides a fairly nice concurrency model).

SQLite concurrency support is a lot more rudimentary, with only a single writer allowed, and the whole database is locked during writing. This necessitates further changes in core (such as splitting the database into two). Given that most network infrastructure is on Postgres right now, SQLite support can be added later.

Reduced responsibilities of SQL

SQL tables have been trimmed as much as possible to avoid conflicts, essentially we only store persistent state such as the latest LCL and SCP history, as well as legacy OFFER table.

Asynchronous externalize flow

There are three important subsystems in core that are in charge of tracking consensus, externalizing and applying ledgers, and advancing the state machine to catchup or synced state:

  • Herder: receives SCP messages, forwards them to SCP, decides if a ledger is externalized, triggers voting for the next ledger
  • LedgerManager: implements closing of a ledger, sets catchup vs synced state, advances and persists last closed ledger.
  • CatchupManager: Keep track of any externalized ledgers that are not LCL+1. That is, keep track of future externalizing ledgers, attempt applying them to keep core in sync, and trigger catchup if needed.

Prior to this change, the externalize flow had two different flows:

  • If core received LCL+1, it would immediately apply it. Which means the flow externalize → closeLedger → set “synced” state happened in one synchronous function. After application, core triggers the next ledger, usually asynchronously, as it needs to wait to meet the 5s ledger requirement.
  • If core received ledger LCL+2..LCL+N it would asynchronously buffer it, and continue buffering new ledgers. If core can’t close the gap and apply everything sequentially, it would go into catchup flow.

With the new changes, the triggering ledger close flow moved to CatchupManager completely. Essentially, CatchupManager::processLedger became a centralized place to decide whether to apply a ledger, or trigger catchup. Because ledger close happens in the background, the transition between externalize and “closeLedger→set synced” becomes asynchronous.

Concurrent ledger close

List of core items that moved to the background followed by explanation why it is safe to do so:

Emitting meta

Ledger application is the only process that touches the meta pipe, no conflicts with other subsystems

Writing checkpoint files

Only the background thread writes in-progress checkpoint files. Main thread deals exclusively with “complete” checkpoints, which after completion must not be touched by any subsystem except publishing.

Updating ledger state

The rest of the system operates strictly on read-only BucketList snapshots, and is unaffected by changing state. Note: there are some calls to LedgerTxn in the codebase still, but those only appear on startup during setup (when node is not operational) or in offline commands.

Incrementing current LCL

Because ledger close moved to the background, guarantees about ledger state and its staleness are now different. Previously, ledger state queried by subsystems outside of apply was always up-to-date. With this change, it is possible the snapshot used by main thread may become slightly stale (if background just closed a new ledger, but main thread hasn't refreshed its snapshot yet). There are different use cases of main thread's ledger state, which must be treated with caution and evaluated individually:

  • When it is safe: in cases, where LCL is used more like a heuristic or an approximation. Program correctness does not depend on the exact state of LCL. Example: post-externalize cleanup of transaction queue. We load LCL’s close time to purge invalid transactions from the queue. This is safe because if LCL has been updated while we call this, the queue is still in a consistent state. In fact, anything in the transaction queue is essentially an approximation, so a slightly stale snapshot should be safe to use.
  • When it is not safe: when LCL is needed in places where the latest ledger state is critical, like voting in SCP, validating blocks, etc. To avoid any unnecessary headaches, we introduce a new invariant: “applying” is a new state in the state machine, which does not allow voting and triggering next ledgers. Core must first complete applying to be able to vote on the “latest state”. In the meantime, if ledgers arrive while applying, we treat them like “future ledgers” and apply the same procedures in herder that we do today (don’t perform validation checks, don’t vote on them, and buffer them in a separate queue). The state machine remains on the main thread only, which ensures SCP can safely execute as long as the state transitions are correct (for example, executing a block production function can safely grab the LCL at the beginning of the function without worrying that it might change in the background).

Reflecting state change in the bucketlist

Close ledger is the only place in the code that updates the BucketList. Other subsystems may only read it. Example is garbage collection, which queries the latest BucketList state to decide which buckets to delete. These are protected with a mutex (the same LCL mutex used in LM, as bucketlist is conceptually a part of LCL as well).

@marta-lokhova marta-lokhova force-pushed the parallelLedgerClose branch 7 times, most recently from 1e53b11 to 6d1ce1f Compare November 16, 2024 01:14
@marta-lokhova marta-lokhova force-pushed the parallelLedgerClose branch 4 times, most recently from db9a9e0 to 097cb43 Compare December 16, 2024 18:37
@marta-lokhova marta-lokhova marked this pull request as ready for review December 16, 2024 18:38
src/bucket/BucketManager.h Outdated Show resolved Hide resolved
@marta-lokhova marta-lokhova self-assigned this Dec 17, 2024
@graydon
Copy link
Contributor

graydon commented Dec 18, 2024

Ok so my guide-level understanding of this follows. Could you confirm I've got it right and am not missing anything?

  1. The database gets multiple sessions, one per thread, and nearly everything that touches it gets session-qualified. This allows a degree of concurrent access.
  2. The hilariously-named "persistent state" table, full of miscellaneous stuff, gets split up to avoid serialization conflicts arising from various table-scan predicates that might otherwise execute concurrently.
  3. A new ledger close thread is added.
  4. The ledger close path gets modified so that:
    • Herder on main thread calls LM::valueExternalized as before
    • LM::valueExternalized does not close ledger anymore, passes ledger to CM::processLedger to enqueue.
    • CM::processLedger enqueues ledger in mSyncingLedgers
      • If CM is in sync:
        • Call CM::tryApplySyncingLedgers, which posts a task over to the close thread to call LM::closeLedger
        • Return true to LM::valueExternalized, which then returns to herder to carry on doing SCP.
      • Else CM is not in sync, return false so LM::valueExternalized can transition LM and CM to catchup.
    • Task on close thread runs LM::closeLedger
      • Which is mostly the same as old closeLedger path, except now it's racing main thread on BL, DB and HM
      • When LM::closeLedger completes, post task back to main thread that does history steps, bucket GC, and notifies Herder of completion previously notified synchronously after LM::valueExternalized.
  5. The BL and HM (and Config and a few other things) get made more threadsafe in general.

If my understanding is correct .. I think this should basically work. And I should express a tremendous congratulations for finding a cut-point that seems like it might work. This is no small feat! It's brilliant.

That said, I remain quite nervous about the details, in I think 3 main ways:

  • General fine-grained lack-of-threadsafety or data-consistency issues. It's hard to be sure you protected the last race.
  • The potential for bugs in the ledger sequence arithmetic and state-transition conditions of LM, CM and Herder. Just because so much has changed here. It's very hard to keep straight the whole set of possible system-state transitions and queue contents, and what the correct thing is to happen in all cases.
  • The potential for bigger blocks of logic on the "two sides of the split" -- main thread doing SCP/herder/enqueue/completion, and close thread doing dequeue/tx-apply/bucket-transfer -- having a subtle unstated assumption of coordinated / synchronous operation. In other words the risk that the cut is "not clean", and some correctness invariant of the system I've forgotten about is violated by running the close thread concurrently.

All 3 of these are diffuse, vague worries I can't point to any specific code actually inhabiting. You've done great here, I would never have thought to make this cut point. But I remain worried. I wonder if there are ways we could audit, detect or mitigate any of those risks.

Copy link
Contributor

@graydon graydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally great work. Handful of minor nits, lots of questions to make sure I'm understanding things, handful of potential clarifications to consider around naming / comments / logic. Plus I wrote a larger "overview question" in the PR comments.

But all that aside, congratulations on the accomplishment here!

src/bucket/BucketManager.cpp Outdated Show resolved Hide resolved
src/catchup/CatchupManagerImpl.h Outdated Show resolved Hide resolved
src/catchup/CatchupManager.h Outdated Show resolved Hide resolved
src/catchup/CatchupManager.h Outdated Show resolved Hide resolved
src/catchup/CatchupManagerImpl.cpp Show resolved Hide resolved
src/test/FuzzerImpl.cpp Outdated Show resolved Hide resolved
src/herder/HerderImpl.cpp Outdated Show resolved Hide resolved
src/catchup/CatchupManagerImpl.cpp Show resolved Hide resolved
src/catchup/CatchupManagerImpl.cpp Show resolved Hide resolved
src/catchup/CatchupManagerImpl.cpp Show resolved Hide resolved
github-merge-queue bot pushed a commit that referenced this pull request Jan 3, 2025
Improve access to ledger state, to better support parallelization
changes in #4543

Note that management of SorobanNetworkConfig is still not great, as
currently LM manages multiple versions of the config. Ideally, soroban
network config lives inside of the state snapshot (either BucketList
snapshot or LedgerTxn), but this was too tricky to implement at this
time due to how network config is currently implemented. We may need to
clean this up later.

This change also partially addresses
#4318
@marta-lokhova marta-lokhova force-pushed the parallelLedgerClose branch 2 times, most recently from 4c3bf0c to 9dce230 Compare January 6, 2025 20:38
@marta-lokhova marta-lokhova force-pushed the parallelLedgerClose branch 2 times, most recently from 67e7306 to 64ebb6c Compare January 7, 2025 03:20
@graydon
Copy link
Contributor

graydon commented Jan 7, 2025

Some points that came up in conversation today which were, if not strictly new to me, nonetheless fresh enough feeling that I think it'd be good to get them into comments / ASCII-art / docs somewhere. I will put them in writing here so that at least one of us can come back to this later to try to transcribe into comments and/or invariants in the code.

The following is all stuff that (AIUI) is true today and maintained by this change:

  • The CM+LM complex only buffers ledgers into CM::mSyncingLedgers when it's "in sync" or in the part of catchup mode that hasn't yet hit the trigger ledger and started running a catchup work.
  • When buffering ledgers into mSyncingLedgers it will try to regain sync state whenever possible (i.e. whenever it gets a contiguous extension of ledgers forward from LCL), at least until it hits the trigger ledger and the catchup work starts running.
  • Once the catchup work is running, it will not apply anything in mSyncingLedgers -- that data is essentially abandoned and the only subsequent applies that will happen will be driven by the catchup work.

Whereas the following is stuff that is new as of this change:

  • The CM+LM complex will now (at least while the LM's state is "in sync" and catchup is not running) always buffer ledgers into mSyncingLedgers even when receiving LCL+1 -- the LM will no longer attempt to special-case LCL+1 and the CM will no longer have to perform compensatory arithmetic for that special case.
  • There is a new conceptual "Apply task" -- which is either an separate "apply thread" or merely a separate "apply step" -- that always consumes entries off the low end of mSyncingLedgers. Its progress is tracked by a new number mLastQueuedToApply a.k.a. Q.
  • The invariant LCL <= Q <= L is the key fact governing 3 producer-consumer processes and their relationships.
    1. Herder consumes SCP progress and produces new ledgers at the L end of mSyncingLedgers, advancing L.
    2. The "Apply task" consumes ledgers from the Q end of mSyncingLedgers, advancing Q toward L, while producing apply-completion events as it applies ledgers.
    3. The remaining LM / trailing main state of the program consumes the completion-events of the apply task, advancing LCL toward Q.
  • In other words: any ledger between Q and L is a buffered entry in mSyncingLedgers and any ledger between LCL and Q is somewhere in the apply task. If the apply task is synchronous there should only ever be 1 ledger in it: Q advances by 1, then LCL advances to meet Q. If the apply task is asynchronous / on a thread, a ledger can get posted to the apply task's ASIO queue while another ledger might be being applied and yet another ledger might be completed and returning from the apply task, in the main thread's ASIO queue.

@marta-lokhova marta-lokhova force-pushed the parallelLedgerClose branch 2 times, most recently from 410b794 to a89cb68 Compare January 7, 2025 17:37
@marta-lokhova
Copy link
Contributor Author

@graydon your comment is almost exactly correct. The only thing I want to note is that buffered ledgers aren't abandoned when catchup starts. We continue buffering them, and trimming them based on checkpoint boundaries, but we won't apply them until normal catchup application is done. Here's how catchup schedules buffered ledger application:

if (mApp.getCatchupManager().maybeGetNextBufferedLedgerToApply())
Importantly, CatchupManager itself won't ever try to apply ledgers if catchup is already running.

Copy link
Contributor

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few typos/comment questions

src/history/test/HistoryTestsUtils.cpp Outdated Show resolved Hide resolved
src/ledger/LedgerManagerImpl.cpp Outdated Show resolved Hide resolved
src/bucket/LiveBucketList.h Outdated Show resolved Hide resolved
src/ledger/LedgerManagerImpl.cpp Outdated Show resolved Hide resolved
src/herder/HerderImpl.cpp Outdated Show resolved Hide resolved
src/herder/HerderPersistenceImpl.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@graydon graydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handful of minor changes that can wait for a followup.

src/catchup/CatchupManagerImpl.cpp Show resolved Hide resolved
src/catchup/CatchupManagerImpl.h Show resolved Hide resolved
src/catchup/CatchupManagerImpl.h Show resolved Hide resolved
src/history/test/HistoryTests.cpp Show resolved Hide resolved
src/history/test/HistoryTestsUtils.cpp Show resolved Hide resolved
@marta-lokhova marta-lokhova added this pull request to the merge queue Jan 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 8, 2025
@marta-lokhova marta-lokhova added this pull request to the merge queue Jan 8, 2025
Merged via the queue into stellar:master with commit c669c8f Jan 8, 2025
13 checks passed
@marta-lokhova marta-lokhova deleted the parallelLedgerClose branch January 8, 2025 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

background ledger close: rewrite externalize path to continue buffering ledgers during ledger close
3 participants