diff --git a/core/cap-0063.md b/core/cap-0063.md new file mode 100644 index 000000000..4f2fa0c63 --- /dev/null +++ b/core/cap-0063.md @@ -0,0 +1,331 @@ +## Preamble + +``` +CAP: 0063 +Title: Parallelism-friendly Transaction Scheduling +Working Group: + Owner: dmkozh@stellar.org + Authors: Dmytro Kozhevin <@dmkozh> + Consulted: Siddharth Suresh <@sisuresh>, Nicolas Barry <@MonsieurNicolas>, Graydon Hoare <@graydon> +Status: Draft +Created: 2024-12-18 +Discussion: https://github.com/stellar/stellar-protocol/discussions/1602 +Protocol version: 23 +``` + +## Simple Summary + +This CAP defines a new structure for transaction sets that allows for applying Smart Contract transactions with multiple threads while maintaining bounded application time. + +## Working Group + +As specified in the Preamble. + +## Motivation + +Every node in Stellar network has to apply the transactions to the current ledger state in order to produce the next block. Currently transaction application happens in a single-threaded fashion, which results in CPU under-utilization on most modern systems. Utilizing the idle cores for applying the transactions provides a straightforward way of increasing the number of transactions that may be included into a single ledger block without compromising the time for closing that ledger. + +Smart Contract transactions on Stellar were designed with parallelism support in mind (see [CAP-46](cap-0046-05.md#static-footprints)). Specifically, every transaction has to declare the entries it may read and modify. This in theory allows Core nodes to pick a set of Smart Contract transactions and come up with a parallel execution schedule that will produce exactly the same result 'as if' the transaction set has been applied sequentially. However, there is currently nothing in the protocol that prevents validators from nominating 'parallel-unfriendly' transaction sets that can't be parallelized at all (e.g. when all transactions want to modify the same key). Thus, the time necessary for applying any given transaction set may vary wildly between `1*x` and `T*x` where T is the number of threads a validator may use. + +This CAP aims to solve this issue and allow for the parallelization that has a guaranteed upper bound for the execution time of transaction sets (given at least the minimum specified number of physical threads supported). + +### Goals Alignment + +This CAP is aligned with the following Stellar Network Goals: + + - The Stellar Network should run at scale and at low cost to all participants of the network. + +## Abstract + +This CAP introduces a new format for the Smart Contract (Soroban) phase of the transaction set. The new format organizes the transactions in the following way: + +- Transactions are split into a sequence of 1 or more 'stages', that have to be applied sequentially +- Every stage consists of multiple clusters of transactions, such that every cluster is completely independent of every other cluster within the stage, so every cluster may be executed in parallel. +- Every cluster consists of potentially dependent transactions, so these transactions have to be generally applied sequentially with respect to each other. + +The CAP also adds a new network setting for controlling the maximum number of clusters per stage, which effectively regulates the maximum degree of parallelism supported by the network. + +The transaction sets defined by this CAP must must not exceed the ledger limits specified by the network configuration, including the new limit on the maximum number of parallel clusters. The validation process allows validators to execute the smart contract transactions using multiple threads while maintaining the upper bound on the total modelled instructions executed by every thread and thus capping the overall execution time. + +In order to maintain the read-only TTL update semantics, the CAP also defines the new algorithm for updating the TTLs of the ledger entries, specifically when multiple transactions within the same ledger update the TTL of the same entry. + +## Specification + +### XDR Changes + +This patch of XDR changes is based on the XDR files in commit `a41b2db15ea34a9f9da5326b996bb8a7ceb5740f` of stellar-xdr. +``` +diff --git a/Stellar-contract-config-setting.x b/Stellar-contract-config-setting.x +index 9f09c7b..a8d3325 100644 +--- a/Stellar-contract-config-setting.x ++++ b/Stellar-contract-config-setting.x +@@ -23,6 +23,16 @@ struct ConfigSettingContractComputeV0 + uint32 txMemoryLimit; + }; + ++// Settings for running the contract transactions in parallel. ++struct ConfigSettingContractParallelComputeV0 ++{ ++ // Maximum number of clusters with dependent transactions allowed in a ++ // stage of parallel tx set component. ++ // This effectively sets the lower bound on the number of physical threads ++ // necessary to effectively apply transaction sets in parallel. ++ uint32 ledgerMaxDependentTxClusters; ++}; ++ + // Ledger access settings for contracts. + struct ConfigSettingContractLedgerCostV0 + { +@@ -302,7 +312,8 @@ enum ConfigSettingID + CONFIG_SETTING_STATE_ARCHIVAL = 10, + CONFIG_SETTING_CONTRACT_EXECUTION_LANES = 11, + CONFIG_SETTING_BUCKETLIST_SIZE_WINDOW = 12, +- CONFIG_SETTING_EVICTION_ITERATOR = 13 ++ CONFIG_SETTING_EVICTION_ITERATOR = 13, ++ CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0 = 14 + }; + + union ConfigSettingEntry switch (ConfigSettingID configSettingID) +@@ -335,5 +346,7 @@ case CONFIG_SETTING_BUCKETLIST_SIZE_WINDOW: + uint64 bucketListSizeWindow<>; + case CONFIG_SETTING_EVICTION_ITERATOR: + EvictionIterator evictionIterator; ++case CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0: ++ ConfigSettingContractParallelComputeV0 contractParallelCompute; + }; + } +diff --git a/Stellar-ledger.x b/Stellar-ledger.x +index 0fc03e2..77db46c 100644 +--- a/Stellar-ledger.x ++++ b/Stellar-ledger.x +@@ -164,6 +164,33 @@ enum TxSetComponentType + TXSET_COMP_TXS_MAYBE_DISCOUNTED_FEE = 0 + }; + ++// A collection of transactions that *may* have arbitrary read-write data ++// dependencies between each other, i.e. in a general case the transaction ++// execution order within a cluster may not be arbitrarily shuffled without ++// affecting the end result. ++typedef TransactionEnvelope DependentTxCluster<>; ++// A collection of clusters such that are *guaranteed* to not have read-write ++// data dependencies in-between clusters, i.e. such that the cluster execution ++// order can be arbitrarily shuffled without affecting the end result. Thus ++// clusters can be executed in parallel with respect to each other. ++typedef DependentTxCluster ParallelTxExecutionStage<>; ++ ++// Transaction set component that contains transactions organized in a ++// parallelism-friendly fashion. ++// ++// The component consists of several stages that have to be executed in ++// sequential order, each stage consists of several clusters that can be ++// executed in parallel, and the cluster itself consists of several ++// transactions that have to be executed in sequential order in a general case. ++struct ParallelTxsComponent ++{ ++ int64* baseFee; ++ // A sequence of stages that *may* have arbitrary data dependencies between ++ // each other, i.e. in a general case the stage execution order may not be ++ // arbitrarily shuffled without affecting the end result. ++ ParallelTxExecutionStage executionStages<>; ++}; ++ + union TxSetComponent switch (TxSetComponentType type) + { + case TXSET_COMP_TXS_MAYBE_DISCOUNTED_FEE: +@@ -178,6 +205,8 @@ union TransactionPhase switch (int v) + { + case 0: + TxSetComponent v0Components<>; ++case 1: ++ ParallelTxsComponent parallelTxsComponent; + }; + + // Transaction sets are the unit used by SCP to decide on transitions +-- +``` + +### Semantics + +#### `ledgerMaxDependentTxClusters` network configuration setting + +A new network configuration setting is introduced in order to regulate the maximum potential degree of parallelism supported by the network. + +The setting is `ledgerMaxDependentTxClusters` and it's introduced in a new `ConfigSettingContractParallelComputeV0` setting struct with `CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0` `ConfigSettingID`. During the protocol upgrade to version 23 the respective configuration setting ledger entry will be created and `ledgerMaxDependentTxClusters` will be set to 1 (no parallelism). + +The setting is used for the validation of transaction sets and its exact semantics is described in the following sections. + +#### Soroban phase structure + +Soroban phase (phase 1 of transaction) has to contain a single component with version 1 (`ParallelTxsComponent`). The phase still has to contain only Soroban transactions and only a single Soroban phase is allowed, there is no protocol change around that. + +`parallelTxsComponent` contains an optional `baseFee` that represents the discounted base fee that must be used as a base fee for every transaction. If set, `baseFee` has to be not greater than a base fee of any transaction in the component. + +`parallelTxsComponent` must consist of zero or more `executionStages`. Every stage in `executionStages` must contain at least one non-empty `DependentTxCluster`. Thus, if a transaction set contains no Soroban transactions, then it must contain a `parallelTxsComponent` with 0 `executionStages`. + + +#### Effective fee computation + +The effective fee computation works in the same fashion as for `TxSetComponent` component (described in [CAP-0042](./cap-0042.md#effective-fee-computation)). + +The effective fee for a given transaction in `ParallelTxsComponent` is computed in the following way: + + * If `baseFee` is empty, then transaction's effective fee is its fee bid. + * If `baseFee` is set, then transaction's effective fee is set the value of `baseFee` + +#### Soroban phase validation + +As per already existing specification, every phase in a transaction set has to be valid in order for it to get applied to the ledger. Beyond the basic XDR representation format described above, the full validation specification for the parallel Soroban phase is defined as follows: + +- When set, `baseFee` must not be higher than the base fee of any transaction in the phase +- Every `DependentTxCluster` has to have transactions sorted by their SHA-256 hashes in increasing order +- Every stage in `executionStages` must have its clusters sorted by SHA-256 hash of the first transaction in the cluster (recall that empty clusters are not allowed) +- Stages must be sorted by SHA-256 hash of the first transaction in the first cluster of the stage (recall that empty stages are not allowed) +- The number of clusters per stage must not exceed the value of `ledgerMaxDependentTxClusters` network configuration setting +- Within a stage, footprint conflicts between the dependent transaction clusters are not allowed. The footprint conflict between two transactions is defined as follows: if a transaction A has a ledger key in its read-write footprint, and another transaction B has the same ledger key in its footprint (either read-only, or read-write), then they're conflicting. A pair of clusters is considered to have a footprint conflict in case if any pair of transactions A from the first cluster and B from the second cluster have a conflict. +- The sum of 'sequential' instructions for the phase must not exceed the value of `ledgerMaxInstructions` network configuration setting. 'Sequential' instructions are defined as follows: + - 'Sequential' instructions per cluster are defined as the sum of `sorobanData.resources.instructions` across all the transactions in the cluster + - 'Sequential' instructions per stage are defined as the max of 'sequential' instructions across its clusters + - 'Sequential' instructions for the phase are defined as the sum of 'sequential' instructions across all the stages in the phase + + +#### Soroban phase application + +Even though some of the transactions in the new phase are data-independent by design, there is still a protocol defined order of application of the transactions. That is, if transactions are applied, the results must be 'as if' they have been applied in that order. The application order is defined as follows: + +- _unchanged_ We define the transaction 'application comparison key' as `sha256(transaction_envelope_xdr) ^ sha256(transaction_set_xdr)`, where `^` is bit-XOR operation. Note, that this is the transaction application order key used by the current protocol as well. +- First, transactions in every cluster are sorted for apply using the comparison key +- Then clusters in every stage are sorted using the comparison key of the first transaction in the sorted cluster +- Then all the stages in phase are sorted using the comparison key of the first transaction of the first sorted cluster in the sorted stage +- The application order is defined as the order of transactions visited during iterating the stages in the sort order defined above, i.e. it can be defined by the following pseudo-code: +``` +for each stage in sorted_stages: + for each sorted_cluster in stage: + for each transaction in sorted_cluster: + application_order.append(transaction) +``` + +During the application the fees must be withdrawn and sequence numbers must be bumped sequentially, in application order defined above and before performing any operations inside the transactions. This is consistent with the current protocol behavior. + +Then all the operations must be applied and the changes in *non-TTL* ledger entries should be 'as if' the operations from every transaction have been applied sequentially. The semantics for processing the TTL changes and computing the rent fees are changed with this CAP. The changes are described in the following section. + +#### TTL update semantics + +Currently the changes to the entry TTL are immediately applied to the ledger state, in the same way as changes to any other ledger entry. However, the protocol allows increasing the TTL even for the entries that have been declared in the footprint as read-only. Thus if we wanted to preserve the current behavior, all the read-only footprint entries would potentially be read-write, and thus only transactions with non-overlapping footprints could belong to different clusters. This is not optimal for parallelization, given that common entries such as Stellar Asset Contract instances and some popular Wasms can be referenced by multiple transactions, but they're always or almost only read-only. + +Thus, this CAP proposes a new way for applying the TTLs changes that makes changes to read-only entries' TTL changes commutative. The new algorithm is as follows: + +For every stage: + - After fees have been withdrawn, but before applying any operations: + - Snapshot the initial `liveUntilLedger` of every entry within the stage's footprint and store it in 'current TTL map' keyed by the TTL entry keys + - Also initialize an empty 'TTL extension map' that has a value of 0 for every key in the footprint + - When applying a Soroban operation: + - When reading the initial state of the entries, use 'current TTL map' to determine the `liveUntilLedger` of every entry in the footprint. + - If an operation succeeds and ledger changes caused by the operation must be materialized: + - For every key in the read-only footprint that had its `liveUntilLedger` increased: + - Add the `new_liveUntilLedger - initial_liveUntilLedger` to the value corresponding to the key in 'TTL extension map' + - For every key in the read-write footprint (unconditionally): + - Set the value corresponding to the key in 'current TTL map' to the new `liveUntilLedger` of the entry + - Set the value corresponding to the key in 'TTL extension map' to 0 + - _unchanged_ The effective rent fee for the operation is still computed based on the initial state of the entry at operation start and the end state of the entry. + - Notice, however, that the initial TTL of the entry itself is defined differently now. + - After applying all the operations within the stage: + - Clamp every value in 'TTL extension map' to be at most `maxEntryTTL` from the `StateArchivalSettings` configuration setting entry + - Add the values in 'TTL extension map' to `liveUntilLedger` of the corresponding entries + +Notice, that while the algorithm is defined in terms of a single mutable 'current TTL map', in implementation it can actually be separated into 2 parts: a fully immutable part for the entries that are always read-only within a stage, and a mutable part that may only be modified while applying transactions in a single cluster. Thus when applying transactions in parallel, no locking is necessary to update the maps during the operation application. + +#### Parallel application of operations + +The protocol only defines the application order and it cannot strictly define how to apply the operations, as long as the end result is equivalent . However, the phase structure and validation rules strongly suggest parallelization between the dependent transaction clusters. Specifically: + +- Within a single stage, every cluster is completely data-independent of every other cluster, and thus every cluster may be applied by a separate thread +- After every thread in the stage has succeeded, the changed ledger entries should be passed to the following stage to be applied in parallel as well +- While the protocol guarantees that there are no data conflicts between the clusters, it's also possible that clusters have 'sub-clusters' that are independent of each other and thus use a higher degree of parallelism than the protocol allows. This doesn't however have any protocol guarantees in terms of the degree of parallelization. + +#### Candidate value generation + +Just like today, this CAP does not specify the exact algorithm used to produce a valid value as to make it easier for implementations to compete on the quality of transaction sets. + +Here is a sketch of a possible greedy algorithm that produces relatively efficient Soroban phase contents: +- Define a small number of stages S (1-4) to generate +- For every stage maintain a set of the dependent transaction clusters (initially empty) +- For every transaction in the order of descending inclusion fee + - For every stage try to fit the transaction into stage: + - Determine the potential new contents of dependent transaction clusters: + - Transaction that doesn't conflict with any other transaction forms a new, single-transaction cluster + - Otherwise, merge the new transaction and all the clusters that have at least one conflict with a new transaction into a new cluster + - Validate the potential new clusters: verify that it's possible to pack these clusters into `ledgerMaxDependentTxClusters` bins such that in every bin the total amount of instructions across all the transactions doesn't exceed `ledgerMaxInstructions / S` instructions. + - If the new clusters are valid, store them in the stage and consider the transaction added, otherwise roll them back to the previous state + - If a transaction doesn't fit into any stage, postpone it to the next ledger and trigger surge pricing +- The output phase consists of the final packings of clusters in every stage into `ledgerMaxDependentTxClusters` + +## Design Rationale + +This proposal tries to strike the balance between the parallelization efficiency, nomination flexibility and simplicity of transaction set validation and application. + +Parallelization efficiency: + +- Sequential stages provide a simple scheduling solution for parallelizing the sets of transactions with a moderate amount of conflicts: + - For example, if a single 'hot' entry that multiple transactions read (e.g. an instance entry of a popular token contract) is updated, then the update can be scheduled in one stage, and all the reads may be efficiently parallelized in the remaining stages. +- On the other hand, the small, limited number of stages minimizes the amount of synchronization necessary - only a couple barriers are necessary +- Application time is limited in terms of modelled instructions, so the real apply time should be relatively predictable +- TTL reconciliation logic allows parallelizing TTL changes which a prevalent for otherwise read-only entries + +Flexibility: + +- The number and size of the stages is not fixed by the protocol, so it's possible to innovate on the nomination algorithms without protocol changes +- The maximum number of clusters is a network setting, so increasing the maximum parallelization degree supported by the network is a change that only requires a single vote and no code changes +- Protocol is also not prescriptive on the exact application logic + + Simplicity: + +- Storing the dependency clusters and stages inside transaction sets makes the validation fast and straightforward +- Minimal amount of new settings is necessary + +### TTL accumulation changes + +The changed TTL update logic also includes a slight change in TTL update semantics when several entries try to update the TTL for the same key. + +Currently, if N transactions want to increase TTL of entry E by L ledgers, only the first (arbitrary) one will actually perform the increase and get the rent fee withdrawn. + +With the changed algorithm (considering that entry is always read-only) the entry TTL will be increased by N * L ledgers and the respective rent fees will be withdrawn from every transaction. The only alternative to that is really to just increase the TTL by N ledgers, but still withdraw the fees, so the approach taken makes the fee consumption more fair. + +Maintaining the current behavior is not an option in a general case, because the read-only key might have its TTL increased by transaction in different clusters and thus the transactions can't observe the TTL changes made by each other. + +Expanding the accumulation logic to even the transactions that must be sequential both makes the logic more consistent, and also allows for further parallelization within a cluster if it's otherwise possible, i.e. with this logic read-only keys can be safely considered truly read-only no matter the execution schedule. + +## Protocol Upgrade Transition + +As soon as this CAP becomes active, validators will produce the new (v1) format of Soroban phase in transaction sets in SCP. + +The configuration entry of type `CONFIG_SETTING_CONTRACT_PARALLEL_COMPUTE_V0` will be created with `ledgerMaxDependentTxClusters` set to the initial value of 1 during the protocol transition. + +This way the only immediate change realized in the protocol will be the TTL update semantics change. Any protocol-supported parallel execution logic won't be active until `ledgerMaxDependentTxClusters` is increased by a separate SLP and a validator vote. + +### Backwards Incompatibilities + +As transaction sets are transmitted over the network, the overlay network will have to be updated to support the new format. + +The XDR changes introduced by the CAP should have minimal ecosystem effect, since only the internal SCP data structures are changed and these are not typically observed outside of the consensus protocol. + +### Resource Utilization + +As `ledgerMaxDependentTxClusters` setting grows, the demand to the hardware running the Stellar Core will grow as well, specifically multi-core processors will become necessary for nodes to stay in sync with the network in case if most of the network bandwidth is being used. Specifically, at least `ledgerMaxDependentTxClusters + 1` cores would be necessary. + +## Security Concerns + +The changes on the security front are minimal as transaction semantics are not changed. + +As usual, the validators have some degree of control over the contents of the transaction sets, especially in terms of simply censoring the transactions that are being included. However, this CAP doesn't meaningfully affect this risk in either direction. + +## Future work + +Improvements to the mem-pooling, nomination and application logic can be implemented as a followup to the initial CAP implementation. These are not protocol changes and may be a part of any Core release (before or after the protocol upgrade). + +## Test Cases + +TBD + +## Implementation + +TBD