Skip to content

Commit

Permalink
incorporate most recent changes from stellar-core gh admin guide (#590)
Browse files Browse the repository at this point in the history
stellar/stellar-core@075df81 included some changes to the soon-
to-be-deprecated `admin.md` file within the codebase. This commit
includes the changes from that commit, and adds those new details
to `monitoring.mdx` in the core-node admin guide.

Refs: stellar/stellar-core#4312
  • Loading branch information
ElliotFriend authored May 28, 2024
1 parent f08724a commit 9501ebf
Showing 1 changed file with 27 additions and 4 deletions.
31 changes: 27 additions & 4 deletions network/core-node/admin-guide/monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -174,15 +174,25 @@ There is a survey mechanism in the overlay that allows a validator to request co

By default, a node will relay or respond to a survey message if the message originated from a node in the receiving nodes transitive quorum. This behavior can be overridden by setting the `SURVEYOR_KEYS` field in the config file to a more restrictive set of nodes to relay or respond to.

The survey works in two phases: the collecting phase, and the reporting phase. During the collecting phase, nodes record information about themselves and their peers, such as the number of messages sent to a given peer. During the reporting phase, the surveyor requests the results of the collecting phase from nodes on the network.

The surveyor begins the collecting phase by broadcasting a `TimeSlicedSurveyStartCollectingMessage`. The surveyor ends the collecting phase and initiates the reporting phase by broadcasting a `TimeSlicedSurveyStopCollectingMessage`. These start/stop collecting messages ensure that the collecting phase is roughly equal for all nodes present for the duration of the collecting phase.

During the reporting phase, the surveyor sends `TimeSlicedSurveyRequestMessage`s to individual nodes to gather the information the node recorded during the collecting phase.

#### Example Survey Command

In this example, we have three nodes `GBBN`, `GDEX`, and `GBUI` (we'll refer to them by the first four letters of their public keys). We will execute the commands below from `GBUI`, and note that `GBBN` has `SURVEYOR_KEYS=["$self"]` in it's config file, so `GBBN` will not relay or respond to any survey messages.

```bash
# 1. Request connection information from the `GBBN` node
stellar-core http-command 'surveytopology?duration=1000&node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL'
# 2. Request connection information from the `GDEX` node
stellar-core http-command 'surveytopology?duration=1000&node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4'
# 1. Begin the surveyor collecting phase
stellar-core http-command 'startsurveycollecting?nonce=1234'
# 2. Stop the surveyor collecting phase, and begin the reporting phase
stellar-core http-command 'stopsurveycollecting?nonce=1234'
# 3. Request survey results from the `GBBN` node
stellar-core http-command 'surveytopologytimesliced?node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL&inboundpeerindex=0&outboundpeerindex=0'
# 4. Request survey results from the `GDEX` node
stellar-core http-command 'surveytopologytimesliced?node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4&inboundpeerindex=0&outboundpeerindex=0'
# 3. Retrieve and display the results of issued survey commands
stellar-core http-command 'getsurveyresult'
```
Expand Down Expand Up @@ -237,6 +247,12 @@ Once the responses are received, the `getsurveyresult` command will return a res
"numTotalOutboundPeers": 0,
"maxInboundPeerCount": 64,
"maxOutboundPeerCount": 8,
"addedAuthenticatedPeers" : 0,
"droppedAuthenticatedPeers" : 0,
"p75SCPFirstToSelfLatencyNs" : 121042,
"p75SCPSelfToOtherLatencyNs" : 112452,
"lostSyncCount" : 0,
"isValidator" : false,
"outboundPeers": null
}
}
Expand All @@ -251,6 +267,7 @@ Some notable fields from this `getsurveyresult` endpoint are:
- `badResponseNodes`: List of nodes that sent a malformed response
- `topology`: Map of nodes to connection information
- `inboundPeers`/`outboundPeers`: List of connection information by nodes
- `averageLatencyMs`: Average latency with this peer in milliseconds.
- `bytesRead`: The total number of bytes read from this peer.
- `bytesWritten`: The total number of bytes written to this peer.
- `duplicateFetchBytesRecv`: The number of bytes received that were duplicate transaction sets and quorum sets.
Expand All @@ -268,6 +285,12 @@ Some notable fields from this `getsurveyresult` endpoint are:
- `version`: stellar-core version.
- `numTotalInboundPeers`/`numTotalOutboundPeers`: The number of total inbound and outbound peers this node is connected to. The response will have a random subset of 25 connected peers per direction (inbound/outbound). These fields tell you if you're missing nodes so you can send another request out to get another random subset of nodes.
- `maxInboundPeerCount`/`maxOutboundPeerCount`: The number of total inbound and outbound peers that this node can accept. These fields correspond to stellar-core configurations `MAX_ADDITIONAL_PEER_CONNECTIONS` and `TARGET_PEER_CONNECTIONS`, respectively.
- `addedAuthenticatedPeers`: The number of authenticated peers added.
- `droppedAuthenticatedPeers`: The number of authenticated peers dropped.
- `p75SCPFirstToSelfLatencyNs`: 75th percentile latency to hear about new SCP messages in nanoseconds.
- `p75SCPSelfToOtherLatencyNs`: 75th percentile latency for other nodes to hear this node's SCP messages in nanoseconds.
- `lostSyncCount`: The number of times this node lost sync.
- `isValidator`: Is this node a validator?

## Quorum Health

Expand Down

0 comments on commit 9501ebf

Please sign in to comment.