Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorporate most recent changes from stellar-core gh admin guide #590

Merged
merged 1 commit into from
May 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions network/core-node/admin-guide/monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -174,15 +174,25 @@ There is a survey mechanism in the overlay that allows a validator to request co

By default, a node will relay or respond to a survey message if the message originated from a node in the receiving nodes transitive quorum. This behavior can be overridden by setting the `SURVEYOR_KEYS` field in the config file to a more restrictive set of nodes to relay or respond to.

The survey works in two phases: the collecting phase, and the reporting phase. During the collecting phase, nodes record information about themselves and their peers, such as the number of messages sent to a given peer. During the reporting phase, the surveyor requests the results of the collecting phase from nodes on the network.

The surveyor begins the collecting phase by broadcasting a `TimeSlicedSurveyStartCollectingMessage`. The surveyor ends the collecting phase and initiates the reporting phase by broadcasting a `TimeSlicedSurveyStopCollectingMessage`. These start/stop collecting messages ensure that the collecting phase is roughly equal for all nodes present for the duration of the collecting phase.

During the reporting phase, the surveyor sends `TimeSlicedSurveyRequestMessage`s to individual nodes to gather the information the node recorded during the collecting phase.

#### Example Survey Command

In this example, we have three nodes `GBBN`, `GDEX`, and `GBUI` (we'll refer to them by the first four letters of their public keys). We will execute the commands below from `GBUI`, and note that `GBBN` has `SURVEYOR_KEYS=["$self"]` in it's config file, so `GBBN` will not relay or respond to any survey messages.

```bash
# 1. Request connection information from the `GBBN` node
stellar-core http-command 'surveytopology?duration=1000&node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL'
# 2. Request connection information from the `GDEX` node
stellar-core http-command 'surveytopology?duration=1000&node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4'
# 1. Begin the surveyor collecting phase
stellar-core http-command 'startsurveycollecting?nonce=1234'
# 2. Stop the surveyor collecting phase, and begin the reporting phase
stellar-core http-command 'stopsurveycollecting?nonce=1234'
# 3. Request survey results from the `GBBN` node
stellar-core http-command 'surveytopologytimesliced?node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL&inboundpeerindex=0&outboundpeerindex=0'
# 4. Request survey results from the `GDEX` node
stellar-core http-command 'surveytopologytimesliced?node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4&inboundpeerindex=0&outboundpeerindex=0'
# 3. Retrieve and display the results of issued survey commands
stellar-core http-command 'getsurveyresult'
```
Expand Down Expand Up @@ -237,6 +247,12 @@ Once the responses are received, the `getsurveyresult` command will return a res
"numTotalOutboundPeers": 0,
"maxInboundPeerCount": 64,
"maxOutboundPeerCount": 8,
"addedAuthenticatedPeers" : 0,
"droppedAuthenticatedPeers" : 0,
"p75SCPFirstToSelfLatencyNs" : 121042,
"p75SCPSelfToOtherLatencyNs" : 112452,
"lostSyncCount" : 0,
"isValidator" : false,
"outboundPeers": null
}
}
Expand All @@ -251,6 +267,7 @@ Some notable fields from this `getsurveyresult` endpoint are:
- `badResponseNodes`: List of nodes that sent a malformed response
- `topology`: Map of nodes to connection information
- `inboundPeers`/`outboundPeers`: List of connection information by nodes
- `averageLatencyMs`: Average latency with this peer in milliseconds.
- `bytesRead`: The total number of bytes read from this peer.
- `bytesWritten`: The total number of bytes written to this peer.
- `duplicateFetchBytesRecv`: The number of bytes received that were duplicate transaction sets and quorum sets.
Expand All @@ -268,6 +285,12 @@ Some notable fields from this `getsurveyresult` endpoint are:
- `version`: stellar-core version.
- `numTotalInboundPeers`/`numTotalOutboundPeers`: The number of total inbound and outbound peers this node is connected to. The response will have a random subset of 25 connected peers per direction (inbound/outbound). These fields tell you if you're missing nodes so you can send another request out to get another random subset of nodes.
- `maxInboundPeerCount`/`maxOutboundPeerCount`: The number of total inbound and outbound peers that this node can accept. These fields correspond to stellar-core configurations `MAX_ADDITIONAL_PEER_CONNECTIONS` and `TARGET_PEER_CONNECTIONS`, respectively.
- `addedAuthenticatedPeers`: The number of authenticated peers added.
- `droppedAuthenticatedPeers`: The number of authenticated peers dropped.
- `p75SCPFirstToSelfLatencyNs`: 75th percentile latency to hear about new SCP messages in nanoseconds.
- `p75SCPSelfToOtherLatencyNs`: 75th percentile latency for other nodes to hear this node's SCP messages in nanoseconds.
- `lostSyncCount`: The number of times this node lost sync.
- `isValidator`: Is this node a validator?

## Quorum Health

Expand Down
Loading