Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: distributed coordinator #1078

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

rfc: distributed coordinator #1078

wants to merge 1 commit into from

Conversation

burgerdev
Copy link
Contributor

No description provided.

@burgerdev burgerdev added the no changelog PRs not listed in the release notes label Dec 17, 2024
Comment on lines +145 to +146
bytes MeshCAKey = 5;
bytes MeshCACert = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above.

The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application.

While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators.

I think what we need are the following invariants:

  • (A) A coordinator with current manifest M only sends key material to pods that have the coordinator role in M.
  • (B) A coordinator with current manifest M uses only key material that it generated locally or that was received from a pod with the coordinator role in M.

(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:

  1. Load the existing latest transition from the store, keeping the signature around but not checking it yet.
  2. Fetch the corresponding manifest, but don't set the state yet.
  3. Create a validator from the temporary manifest's reference values.
  4. Connect to the serving coordinator and validate its reference values.
  5. Check that the serving coordinator's policy corresponds to a coordinator role in the temp manifest.
  6. Receive the RecoverResponse.
  7. Verify the signature from (1).
  8. Initialize the state with received seed, keys, certs and the temp manifest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a nice summary and solution. Though, I think we could omit the roles from the manifest if we'd need to simplify it, since in the eyes of the data owner all code of all components inside the mesh/deployment must be trusted anyway. I.e., the workload owner should never be able to impersonate any component from the deployment, then all guarantees of shielding the data owner against the workload owner are broken.

This way the verification of the endpoint that recovers the coordinator could be simplified so that the recovering coordinator is first "initialized" by the running coordinator, like any other workload. Then the application logic notices that it was recovered and requests the locally kept keys of the running coordinator. The endpoint that is called here resides behind mTLS client auth.

The big trade off here, I think, is due to the asynchronous/split nature there is more potential that future changes break the security on the other hand I think the two parts are conceptually quite simple.


## Background

The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the coordinator state non-sensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to be CC-secure, if that's what you mean.

The list of state transitions needs to be checked for integrity.
Otherwise, an attacker that can manipulate the transition objects can set arbitrary manifests.
Therefore, we sign each state transition with a key derived from the secret seed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog PRs not listed in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants