-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: distributed coordinator #1078
base: main
Are you sure you want to change the base?
Conversation
bytes MeshCAKey = 5; | ||
bytes MeshCACert = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above.
The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application.
While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators.
I think what we need are the following invariants:
- (A) A coordinator with current manifest
M
only sends key material to pods that have thecoordinator
role inM
. - (B) A coordinator with current manifest
M
uses only key material that it generated locally or that was received from a pod with thecoordinator
role inM
.
(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:
- Load the existing latest transition from the store, keeping the signature around but not checking it yet.
- Fetch the corresponding manifest, but don't set the state yet.
- Create a validator from the temporary manifest's reference values.
- Connect to the serving coordinator and validate its reference values.
- Check that the serving coordinator's policy corresponds to a
coordinator
role in the temp manifest. - Receive the
RecoverResponse
. - Verify the signature from (1).
- Initialize the state with received seed, keys, certs and the temp manifest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a nice summary and solution. Though, I think we could omit the roles from the manifest if we'd need to simplify it, since in the eyes of the data owner all code of all components inside the mesh/deployment must be trusted anyway. I.e., the workload owner should never be able to impersonate any component from the deployment, then all guarantees of shielding the data owner against the workload owner are broken.
This way the verification of the endpoint that recovers the coordinator could be simplified so that the recovering coordinator is first "initialized" by the running coordinator, like any other workload. Then the application logic notices that it was recovered and requests the locally kept keys of the running coordinator. The endpoint that is called here resides behind mTLS client auth.
The big trade off here, I think, is due to the asynchronous/split nature there is more potential that future changes break the security on the other hand I think the two parts are conceptually quite simple.
|
||
## Background | ||
|
||
The Contrast Coordinator is a stateful service with a backend storage that can't be shared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the coordinator state non-sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to be CC-secure, if that's what you mean.
Lines 38 to 40 in d4892d3
The list of state transitions needs to be checked for integrity. | |
Otherwise, an attacker that can manipulate the transition objects can set arbitrary manifests. | |
Therefore, we sign each state transition with a key derived from the secret seed. |
No description provided.