-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: distributed coordinator #1078
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
# RFC 009: Distributed coordinator | ||
|
||
## Background | ||
|
||
The Contrast Coordinator is a stateful service with a backend storage that can't be shared. | ||
If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually. | ||
This leads to predictable outages on AKS, where the nodes are replaced periodically. | ||
|
||
## Requirements | ||
|
||
* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow. | ||
* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step. | ||
* A newly started Coordinator that has recovered peers must eventually recover automatically. | ||
* Consistency: the Coordinator state must be strongly consistent. | ||
|
||
## Design | ||
|
||
### Persistency changes | ||
|
||
The Coordinator uses a generic key-value store interface to read from and write to the persistent store. | ||
In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently. | ||
|
||
```golang | ||
type Store interface { | ||
Get(key string) ([]byte, error) | ||
Set(key string, value []byte) error | ||
CompareAndSwap(key string, oldVal, newVal []byte) error | ||
Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009. | ||
} | ||
``` | ||
|
||
There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past). | ||
We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them. | ||
`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field. | ||
We also add a new `Watch` method that facilitates reacting on manifest updates. | ||
This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects. | ||
|
||
[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions. | ||
However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`. | ||
A few considerations on using Kubernetes objects for storage: | ||
|
||
1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters. | ||
We work around that by truncating the name and storing a map of full hashes to content. | ||
2. The `latest` transition is not content-addressable. | ||
We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace. | ||
Should a use-case arise, we could make that name configurable. | ||
3. Etcd has a limit of 1.5MiB per value. | ||
This is more than enough for policies, which are the largest objects we store right now at around 100KiB. | ||
However, we should make sure that the collision probability is low enough that not to many policies end up in the same config. | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: "contrast-store-$(dirname)-$(truncated basename)" | ||
labels: | ||
app.kubernetes.io/managed-by: "contrast.edgeless.systems" | ||
data: | ||
"$(basename)": "$(data)" | ||
``` | ||
|
||
[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency> | ||
|
||
### Recovery mode | ||
|
||
The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset. | ||
Recovery mode can be entered as follows: | ||
|
||
* The Coordinator starts up in recovery mode. | ||
* Receiving a watch event for the latest manifest, and the latest manifest is not the current one. | ||
* Syncing the latest manifest during API calls and discovering a new manifest. | ||
|
||
Recovery mode exits after the state has been set by: | ||
|
||
* A successful [Peer recovery](#peer-recovery). | ||
* A successful recovery through the `userapi.Recover` RPC. | ||
* A successful initial `userapi.SetManifest` RPC. | ||
|
||
### Service changes | ||
|
||
Coordinators are going to recover themselves from peers that are already recovered (or set). | ||
This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none. | ||
Since we're running in Kubernetes, we can leverage the built-in probe types. | ||
|
||
We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure): | ||
|
||
* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once. | ||
* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store. | ||
* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode. | ||
|
||
Using these probes, we expose the Coordinators in different ways, depending on audience: | ||
|
||
* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients. | ||
* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from. | ||
Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends. | ||
* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`. | ||
The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered). | ||
|
||
### Manifest changes | ||
|
||
In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover. | ||
There are two reasons why we want to make this explicit: | ||
|
||
1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates. | ||
2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient. | ||
|
||
Thus, we introduce a new field to the manifest that specifies roles available to the verified identity. | ||
For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs). | ||
|
||
```golang | ||
type PolicyEntry struct { | ||
SANs []string | ||
WorkloadSecretID string `json:",omitempty"` | ||
Roles []string `json:",omitempty"` // <-- New in RFC009. | ||
} | ||
``` | ||
|
||
During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry. | ||
If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy. | ||
|
||
### Peer recovery | ||
|
||
The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode): | ||
|
||
1. Once after startup, to speed up initialization. | ||
2. Once as callback to the `Watch` event, . | ||
3. Periodically from a goroutine, to reconcile missed watch events. | ||
|
||
The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators. | ||
For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method. | ||
|
||
```proto | ||
service MeshAPI { | ||
rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse); | ||
rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009. | ||
} | ||
|
||
message RecoverRequest {} | ||
|
||
message RecoverResponse { | ||
bytes Seed = 1; | ||
bytes Salt = 2; | ||
bytes RootCAKey = 3; | ||
bytes RootCACert = 4; | ||
bytes MeshCAKey = 5; | ||
bytes MeshCACert = 6; | ||
Comment on lines
+145
to
+146
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above. The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application. While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, thanks! We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators. I think what we need are the following invariants:
(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that is a nice summary and solution. Though, I think we could omit the roles from the manifest if we'd need to simplify it, since in the eyes of the data owner all code of all components inside the mesh/deployment must be trusted anyway. I.e., the workload owner should never be able to impersonate any component from the deployment, then all guarantees of shielding the data owner against the workload owner are broken. This way the verification of the endpoint that recovers the coordinator could be simplified so that the recovering coordinator is first "initialized" by the running coordinator, like any other workload. Then the application logic notices that it was recovered and requests the locally kept keys of the running coordinator. The endpoint that is called here resides behind mTLS client auth. The big trade off here, I think, is due to the asynchronous/split nature there is more potential that future changes break the security on the other hand I think the two parts are conceptually quite simple. |
||
bytes LatestManifest = 7; | ||
} | ||
``` | ||
|
||
When this method is called, the serving Coordinator: | ||
|
||
0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls). | ||
1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role). | ||
2. Fills the response with values from the current state. | ||
|
||
After receiving the `RecoverResponse`, the client Coordinator: | ||
|
||
1. Sets up a temporary `SeedEngine` with the received parameters. | ||
2. Verifies that the received manifest is the current latest. | ||
If not, it stops and enters recovery mode again. | ||
3. Updates the state with the current manifest and the temporary seed engine. | ||
|
||
If the recovery was successful, the client Coordinator leaves the peer recovery process. | ||
Otherwise, it continues with the next available peer, or fails the process if none are left. | ||
|
||
## Open issues | ||
|
||
* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates: | ||
* Store the mesh cert alongside the manifest | ||
* Sign the latest transition with the mesh key (and resign on `userapi.Recover`). | ||
|
||
## Alternatives considered | ||
|
||
* TODO(burgerdev): different service structure | ||
* TODO(burgerdev): push vs pull | ||
* TODO(burgerdev): active mesh CA key distribution | ||
* TODO(burgerdev): nested CAs, reuse `NewMeshCert` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the coordinator state non-sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to be CC-secure, if that's what you mean.
contrast/rfc/004-recovery.md
Lines 38 to 40 in d4892d3