edgelesssys · burgerdev · Dec 17, 2024 · msanft · Jan 3, 2025 · burgerdev
diff --git a/rfc/009-distributed-coordinator.md b/rfc/009-distributed-coordinator.md
@@ -0,0 +1,178 @@
+# RFC 009: Distributed coordinator
+
+## Background
+
+The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
 The list of state transitions needs to be checked for integrity. 
 Otherwise, an attacker that can manipulate the transition objects can set arbitrary manifests. 
 Therefore, we sign each state transition with a key derived from the secret seed. 
 The list of state transitions needs to be checked for integrity. 
 Otherwise, an attacker that can manipulate the transition objects can set arbitrary manifests. 
 Therefore, we sign each state transition with a key derived from the secret seed. 
+If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually.
+This leads to predictable outages on AKS, where the nodes are replaced periodically.
+
+## Requirements
+
+* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow.
+* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step.
+  * A newly started Coordinator that has recovered peers must eventually recover automatically.
+* Consistency: the Coordinator state must be strongly consistent.
+
+## Design
+
+### Persistency changes
+
+The Coordinator uses a generic key-value store interface to read from and write to the persistent store.
+In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently.
+
+```golang
+type Store interface {
+  Get(key string) ([]byte, error)
+  Set(key string, value []byte) error
+  CompareAndSwap(key string, oldVal, newVal []byte) error
+  Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009.
+}
+```
+
+There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past).
+We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them.
+`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field.
+We also add a new `Watch` method that facilitates reacting on manifest updates.
+This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects.
+
+[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions.
+However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`.
+A few considerations on using Kubernetes objects for storage:
+
+1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters.
+   We work around that by truncating the name and storing a map of full hashes to content.
+2. The `latest` transition is not content-addressable.
+   We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace.
+   Should a use-case arise, we could make that name configurable.
+3. Etcd has a limit of 1.5MiB per value.
+   This is more than enough for policies, which are the largest objects we store right now at around 100KiB.
+   However, we should make sure that the collision probability is low enough that not to many policies end up in the same config.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: "contrast-store-$(dirname)-$(truncated basename)"
+  labels:
+    app.kubernetes.io/managed-by: "contrast.edgeless.systems"
+data:
+  "$(basename)": "$(data)"
+```
+
+[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency>
+
+### Recovery mode
+
+The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset.
+Recovery mode can be entered as follows:
+
+* The Coordinator starts up in recovery mode.
+* Receiving a watch event for the latest manifest, and the latest manifest is not the current one.
+* Syncing the latest manifest during API calls and discovering a new manifest.
+
+Recovery mode exits after the state has been set by:
+
+* A successful [Peer recovery](#peer-recovery).
+* A successful recovery through the `userapi.Recover` RPC.
+* A successful initial `userapi.SetManifest` RPC.
+
+### Service changes
+
+Coordinators are going to recover themselves from peers that are already recovered (or set).
+This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none.
+Since we're running in Kubernetes, we can leverage the built-in probe types.
+
+We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure):
+
+* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once.
+* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store.
+* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode.
+
+Using these probes, we expose the Coordinators in different ways, depending on audience:
+
+* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients.
+* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from.
+  Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends.
+* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`.
+  The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered).
+
+### Manifest changes
+
+In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover.
+There are two reasons why we want to make this explicit:
+
+1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates.
+2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient.
+
+Thus, we introduce a new field to the manifest that specifies roles available to the verified identity.
+For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs).
+
+```golang
+type PolicyEntry struct {
+  SANs             []string
+  WorkloadSecretID string `json:",omitempty"`
+  Roles            []string `json:",omitempty"` // <-- New in RFC009.
+}
+```
+
+During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry.
+If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy.
+
+### Peer recovery
+
+The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode):
+
+1. Once after startup, to speed up initialization.
+2. Once as callback to the `Watch` event, .
+3. Periodically from a goroutine, to reconcile missed watch events.
+
+The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators.
+For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method.
+
+```proto
+service MeshAPI {
+  rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse);
+  rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009.
+}
+
+message RecoverRequest {}
+
+message RecoverResponse {
+  bytes Seed = 1;
+  bytes Salt = 2;
+  bytes RootCAKey = 3;
+  bytes RootCACert = 4;
+  bytes MeshCAKey = 5;
+  bytes MeshCACert = 6;
+  bytes LatestManifest = 7;
+}
+```
+
+When this method is called, the serving Coordinator:
+
+0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls).
+1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role).
+2. Fills the response with values from the current state.
+
+After receiving the `RecoverResponse`, the client Coordinator:
+
+1. Sets up a temporary `SeedEngine` with the received parameters.
+2. Verifies that the received manifest is the current latest.
+   If not, it stops and enters recovery mode again.
+3. Updates the state with the current manifest and the temporary seed engine.
+
+If the recovery was successful, the client Coordinator leaves the peer recovery process.
+Otherwise, it continues with the next available peer, or fails the process if none are left.
+
+## Open issues
+
+* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates:
+  * Store the mesh cert alongside the manifest
+  * Sign the latest transition with the mesh key (and resign on `userapi.Recover`).
+
+## Alternatives considered
+
+* TODO(burgerdev): different service structure
+* TODO(burgerdev): push vs pull
+* TODO(burgerdev): active mesh CA key distribution
+* TODO(burgerdev): nested CAs, reuse `NewMeshCert`