Skip to content

Commit

Permalink
docs: High availability explanation page (#940)
Browse files Browse the repository at this point in the history
  • Loading branch information
bschimke95 authored Jan 10, 2025
1 parent 6087231 commit 37d86c1
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 0 deletions.
44 changes: 44 additions & 0 deletions docs/src/snap/explanation/high-availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# High availability

High availability (HA) is a core feature of {{ product }}, ensuring that
a Kubernetes cluster remains operational and resilient, even when nodes or
critical components encounter failures. This capability is crucial for
maintaining continuous service for applications and workloads running in
production environments.

HA is automatically enabled in {{ product }} for clusters with three or
more nodes independent of the deployment method. By distributing key components
across multiple nodes, HA reduces the risk of downtime and service
interruptions, offering built-in redundancy and fault tolerance.

## Key components of a highly available cluster

A highly available Kubernetes cluster exhibits the following characteristics:

### 1. **Multiple nodes for redundancy**

Having multiple nodes in the cluster ensures workload distribution and
redundancy. If one node fails, workloads will be rescheduled automatically on
other available nodes without disrupting services. This node-level redundancy
minimizes the impact of hardware or network failures.

### 2. **Control plane redundancy**

The control plane manages the cluster’s state and operations. For high
availability, the control plane components—such as the API server, scheduler,
and controller-manager—are distributed across multiple nodes. This prevents a
single point of failure from rendering the cluster inoperable.

### 3. **Highly available datastore**

By default, {{ product }} uses **dqlite** to manage the Kubernetes
cluster state. Dqlite leverages the Raft consensus algorithm for leader
election and voting, ensuring reliable data replication and failover
capabilities. When a leader node fails, a new leader is elected seamlessly
without administrative intervention. This mechanism allows the cluster to
remain operational even in the event of node failures. More details on
replication and leader elections can be found in
the [dqlite replication documentation][Dqlite-replication].

<!-- LINKS -->
[Dqlite-replication]: https://dqlite.io/docs/explanation/replication
1 change: 1 addition & 0 deletions docs/src/snap/explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ channels
clustering
ingress
epa
high-availability
security
cis
```
Expand Down

0 comments on commit 37d86c1

Please sign in to comment.