Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Certificate Manager #135

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions 087-external-certificate-manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# External Certificate Manager

This proposal aims to allow Strimzi users to use an external certificate manager, specifically [cert-manager](https://cert-manager.io/), to manage certificates.

## Current situation

There are two different categories of certificates that Strimzi handles:
* The term "cluster" refers to certificates that are issued for the Strimzi components:
* ZooKeeper nodes
* Kafka nodes
* Cluster, User and Topic operators
* Cruise Control
* Kafka Exporter
* The term "clients" refers to certificates that are issued for user applications using the User Operator, or through another external mechanism chosen by the user.

For both categories, to provide a secure, TLS-enabled setup by default when deploying Kafka clusters, Strimzi integrated its own CA operations into the Cluster Operator.
The Cluster Operator accomplishes this by using openssl to generate self-signed root CA certificates and private keys which it then uses to directly sign end-entity (EE) certificates.
This CA certificate has zero pathlen, which means it cannot sign any intermediate CA.
A cluster CA and clients CA are generated.
These CAs are only used for Kafka clusters and a unique instance of each CA is used for each Kafka cluster.

In addition to Strimzi fully managing the certificates as described above, there are options for users to partially manage the certificates:
* Users can [install and use their own CA certificate and private keys](https://strimzi.io/docs/operators/latest/deploying#installing-your-own-ca-certificates-str), instead of using the defaults generated by the Cluster Operator.
When using this option, both the CA certificate and private key must be provided, and Strimzi still issues the end-entity (EE) certificates that are presented by the components.
* Users can [provide custom listener certificates](https://strimzi.io/docs/operators/latest/deploying#proc-installing-certs-per-listener-str) for TLS encryption.
This option only affects how user applications connect to Kafka.
It does not change how the Strimzi components connect to Kafka, or how the Kafka brokers connect to each other.

None of the existing options allow the certificate management to be done completely separately from Strimzi.

## Motivation

Strimzi's primary purpose is to provide a way to run Apache Kafka clusters on Kubernetes.
Although it is nice that it can manage certificates, it would be beneficial if the certificates could be managed by a dedicated certificate manager, such as [cert-manager](https://cert-manager.io/).
This is a feature that is often requested, especially because many organizations have specific compliance requirements with regard to certificates, for example:
* Requiring that CA private keys are not shared.
* Requiring that self-signed certificates cannot be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really helps because the CA will be anyway bootstrapped as self-signed as it is today in most cases, and there is not much we can do about it.


## Proposal

Strimzi will be updated to allow users to specify that certificates should be issued by an external certificate manager, rather than issued by the Cluster Operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Strimzi will be updated to allow users to specify that certificates should be issued by an external certificate manager, rather than issued by the Cluster Operator.
Strimzi will be updated to allow users to specify that certificates should be issued by an external certificate manager instead of the Cluster Operator.

This proposal will specifically describe how this would work for cert-manager, however the user API for configuration will be written in a way that does not prevent other external certificate managers being added in the future.

The proposal makes a few assumptions:
* Strimzi will not be responsible for installing cert-manager, but we will document the versions of cert-manager that we have tested with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Strimzi will not be responsible for installing cert-manager, but we will document the versions of cert-manager that we have tested with.
* Strimzi will not be responsible for installing cert-manager, but we will document the supported versions of cert-manager that we have tested with.

* Strimzi will not be responsible for creating `Issuer` or `ClusterIssuer` custom resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is because we want to keep for self-signed certs our current way and not to add another option that will mostly add just support burden?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because there are lots of different issuers that work with cert-manager. So rather than Strimzi having to actively support all the different types, I've proposed that the user creates the Issuer or ClusterIssuer and handles supplying a Secret with the trusted certificates for the issuer they have chosen. That way Strimzi can work with any cert-manager issuer integrations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we need to provide any guidelines on conventions in the docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to mention in in the docs properly as users could be confused from different projects where integration of CM works without creating any Issuer (afaiu operator creates self-sign Issuer when it is not created by users).

* Strimzi will create `Certificate` custom resources and will allow the user to influence the contents of these resources by exposing options in the `Kafka` custom resource.
* Strimzi will not directly interact with the lower level `CertificateRequest` and `CertificateSigningRequests` custom resources.
* When Strimzi creates a `Certificate` custom resource, cert-manager will issue the certificate within a reasonable amount of time such that Strimzi can wait during the reconciliation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are talking about generating the CA certificate right? Can we make it explicit here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any potential issues with time-outs and retries need to be mentioned here?

* Users will provide to the Strimzi Cluster Operator the CA certificates it must trust for the current issuer via a Kubernetes Secret.

### API

The existing `spec.clusterCa` and `spec.clientsCa` fields will be extended to add a new property `certificateIssuer`:

```yaml
spec:
clusterCa:
validityDays: <integer> # notBefore=now, notAfter=now + validityDays
generateCertificateAuthority: <boolean>
generateSecretOwnerReference: <boolean>
renewalDays: <integer> # days before notAfter when we should start renewal
certificateExpirationPolicy: <renew-certificate|replace-key>
certificateIssuer:
type: <internal|cert-manager.io> # (1)
issuerRef: # (2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issuerRef element looks to be cert-manager specific rather than something relevant to any external cert manager which could become an issue when supporting other external managers (as mentioned above).
Perhaps the certificateIssuer should instead have a certManager specific sub-element, and add different similar elements in the future that are specific to other certificate managers, i.e. something like:

clusterCa:
    certificateIssuer:
        certManager:
            issuerRef:
                name: <string>
                kind: <Issuer|ClusterIssuer>
                group: <string> # cert-manager.io by default
        someOtherManager: <-- future addition -->
            managerSpecificConfig:
                ...
        oneOf:
        - properties
            certManager{}
            someOtherManager{}

Or alternatively just allow a map of values to be specified, but that would be less user friendly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The property certificateIssuer.issuerRef will only be used by Strimzi if certificateIssuer.type is set to cert-manager.io.

From the above phrase it looks like the intention is to make issuerRef cert-maanger specific.

name: <string>
kind: <Issuer|ClusterIssuer>
group: <string> # cert-manager.io by default
Comment on lines +58 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the type separation happen already at the clusterCa level? It would seem more logical to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scholzj Do you mean like this?

spec:
  clusterCa:
    validityDays: <integer> # notBefore=now, notAfter=now + validityDays
    generateCertificateAuthority: <boolean>
    generateSecretOwnerReference: <boolean>
    renewalDays: <integer> # days before notAfter when we should start renewal
    certificateExpirationPolicy: <renew-certificate|replace-key>
    certificateIssuerType: <internal|cert-manager.io> # (1)
    certManagerIssuerRef: # (2)
      name: <string>
      kind: <Issuer|ClusterIssuer>
      group: <string> # cert-manager.io by default

Copy link
Member

@scholzj scholzj Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Although I would maybe call it just type instead of certificateIssuerType. It would also default to internal when not set (I hope that can be implemented in the Java api classes).

I think that would create a better abstraction as not all the fields in the CA configuration might be applicable to all issuer types.

```

1. If the `certificateIssuer` and `type` properties are not set it will default to `internal` and will use the existing behaviour, allowing backwards compatibility.
The option `cert-manager.io` will only be valid if `generateCertificateAuthority` is set to `false`.
2. The property `certificateIssuer.issuerRef` will only be used by Strimzi if `certificateIssuer.type` is set to `cert-manager.io`.
The `name`, `kind`, and `group` properties will be copied over into the `Certificate` custom resource Strimzi creates.

### User steps

To make use of this new option the user will have to:

1. Install cert-manager.
2. Create an `Issuer` or `ClusterIssuer` custom resource.
3. Create a `Secret` containing the CAs for Strimzi to trust.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which CAs are we talking about here? Aren't the cluster CA and clients CA being generate via cert-manager (it relates to my previous question I guess) ... I am confused :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the secret is referenced in the Issuer or ClusterIssuer, should we create it first?

Users can optionally use [trust-manager](https://cert-manager.io/docs/trust/trust-manager/) to create this Secret, but they are responsible for installing trust-manager, creating the `Bundle` CR and annotating the resulting Secret with the Strimzi cert annotation.
Comment on lines +83 to +84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should instead (or next to this) describe the expectation of how the Secret should look like? IIRC, trust-manager creates a Secret will all CAs bundled into a single file? Is that supported / expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this needs some clarification. Maybe a sequence diagram with the PKI generation process including user, CO, trust-manager and cert-manager would also help.

4. Create a `Kafka` resource with `clusterCa.certificateIssuer` and/or `clientsCa.certificateIssuer` configured.

Notes:
* The `Secret` that contains the CAs will be the same `Secret` currently used, so either `<CLUSTER_NAME>-cluster-ca-cert` or `<CLUSTER_NAME>-clients-ca-cert`.
Comment on lines +87 to +88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should again aim here for a separation of the Strimzi used Secrets and the user-provided Secrets. I.e. the user should provide it in some custom secret and we should copy it ourself into <CLUSTER_NAME>-cluster-ca-cert or <CLUSTER_NAME>-clients-ca-cert if needed.


### Handling trust rollout

Similar to today Strimzi will use the notion of a "generation" to determine whether to roll the cluster to pick up changes in either the `<CLUSTER_NAME>-cluster-ca-cert` or `<CLUSTER_NAME>-clients-ca-cert`.
When the user creates the CA cert Secret they must add the `strimzi.io/ca-cert-generation` annotation to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Does this work with the trust-manager suggested earlier?
  • Can we work around it and manage the generation ourself? E.g. based on the hash of the user-provided certificate detect changes and bump the generation ourself?
  • What is the impact on the strimzi.io/ca-key-generation given we now do not have the private key secret?

If the user updates the Secret to change the certificates included they must increment the annotation to inform Strimzi it has changed.
Similar to today Strimzi will put the annotation on the pods (Kafka, ZooKeeper etc) to be able to spot when the generation has been changed.
Comment on lines +92 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similar to today Strimzi will use the notion of a "generation" to determine whether to roll the cluster to pick up changes in either the `<CLUSTER_NAME>-cluster-ca-cert` or `<CLUSTER_NAME>-clients-ca-cert`.
When the user creates the CA cert Secret they must add the `strimzi.io/ca-cert-generation` annotation to it.
If the user updates the Secret to change the certificates included they must increment the annotation to inform Strimzi it has changed.
Similar to today Strimzi will put the annotation on the pods (Kafka, ZooKeeper etc) to be able to spot when the generation has been changed.
Strimzi will use the current process to determine whether to roll the cluster to pick up changes in the `<CLUSTER_NAME>-cluster-ca-cert` or `<CLUSTER_NAME>-clients-ca-cert` secrets.
When a user creates the CA certificate secret, they must add the `strimzi.io/ca-cert-generation` annotation.
Strimzi adds this annotation to the pods (Kafka, ZooKeeper, etc.) and uses it to detect when the secrets have changed.
If the user updates the secret to change the certificates, they must increment the annotation to inform Strimzi of the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider maintenance time windows at all?


### Issuing certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Issuing certificates
### Issuing end-entity certificates


If the user has enabled this feature, when Strimzi needs to issue a certificate, instead of using the existing internal mechanism it will create a `Certificate` custom resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it clear that this is where we control the CN / SANs of the certificates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be very very useful to have this implementation behind an interface and support a mechanism for loading alternative implementations for other external certificate managers. This would allow users to integrate with other external certificate managers

Strimzi will wait during the reconciliation loop for the `Certificate` status to indicate that the certificate has been issued before continuing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long will we wait? The ussual operation timeout?

When issuing cluster certificates (e.g for Kafka etc), once the certificate has been issued, Strimzi will annotate the cert-manager provided Secret with the `strimzi.io/server-cert-hash` annotation with the value being the hash of the certificate in the Secret.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issuing cluster certificates (e.g for Kafka etc) - I wonder if it would be useful to include the secret names for these certificates as an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not control that Secret, so we should likely not annotate it / rely on its annotation. We need to cary it on the internal Secrets maybe?

Similar to today, Strimzi will also add this hash annotation to the pods to track whether they are mounting the latest version of the Secret.

### Tracking changes to cluster end-entity certificates

Cert-manager will be responsible for renewing all end-entity certificates.
When a certificate is renewed cert-manager will update the related Secret.

For cluster certificates (e.g. for Kafka etc), Strimzi will track and handle these changes using the `strimzi.io/server-cert-hash` annotation.
During the reconciliation loop, even if all cluster end-entity certificates have been issued, Strimzi will patch the certificate Secrets with the correct `strimzi.io/server-cert-hash` annotation.
The value of this annotation can then be compared with the value on the pods to determine whether the pods need to be restarted to pick up a new Secret.
Comment on lines +109 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the need to annotate the Secrets here. Normally, during the reconciliation:

  • You take the hash of the certiicate
  • Use the Hash to annotate the Pod in the Deployment / StrimziPodSet
  • Either Kubernetes or Strimzi takes care of rolling the pod based on the Pod annotations being different


For user certificates (issued by the User Operator), the user will be responsible for making sure their applications notice cert-manager renewing the certificates and are updated to use the new certificate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess UO and the Clients CA deserves more attention here? Will the UO issue the Certificate resources? How will it keep the certificate Secrets? Or will type: tls-external be used here?


## Affected/not affected projects

This affects the Cluster Operator and User Operator.

## Compatibility

This feature will be optional and not disabled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This feature will be optional and not disabled by default.
This feature will be optional and enabled by default.


### Migrating to this feature

To start using this feature in an existing Kafka cluster the user must:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a new cluster?

1. Install cert-manager and create an `Issuer`.
2. Pause reconciliation for their Kafka cluster.
3. Update the `<CLUSTER_NAME>-cluster-ca-cert` and/or `<CLUSTER_NAME>-clients-ca-cert` Secrets to:
1. contain the CA(s) for the `Issuer` (keeping the old CA cert in the Secret still)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is really important - (keeping the old CA cert in the Secret still) - and should be made more prominent? We should also cover the point when it should be removed.

2. increment the `strimzi.io/ca-cert-generation` annotation
4. Update the `Kafka` resource to configure the `clusterCa.certificateIssuer` and/or `clientsCa.certificateIssuer` sections.
5. Resume reconciliation.

When using this feature for the ClientsCa:
* On the next Cluster Operator reconciliation Strimzi will roll the Kafka pods to trust the new CA cert.
* On the next User Operator reconciliation Strimzi will create `Certificate` resources for all the existing `KafkaUser` custom resources.

When using this feature for the ClusterCa:
* On the next reconciliation Strimzi will first roll the pods once to trust the new CA cert.
* Strimzi will create `Certificate` resources for all the components and wait for the certificates to be issued.
* Strimzi will roll the pods to use the new certificates.

Once all the pods have been rolled the user can update the `<CLUSTER_NAME>-cluster-ca-cert` and/or `<CLUSTER_NAME>-clients-ca-cert` Secrets to remove the old CA cert.

### Stopping using this feature

To revert to user managed CAs the user will:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To revert to user managed CAs the user will:
To revert to user managed CAs the user must:

1. Pause reconciliation for their Kafka cluster.
2. Update the `<CLUSTER_NAME>-cluster-ca-cert` and/or `<CLUSTER_NAME>-clients-ca-cert` Secrets to:
1. contain their public CA cert (keeping the old cert-manager one)
2. increment the `strimzi.io/ca-cert-generation` annotation
3. Create the `<CLUSTER_NAME>-cluster-ca` and/or `<CLUSTER_NAME>-clients-ca` private key Secrets
4. Update the `Kafka` resource to change the `clusterCa.certificateIssuer` and/or `clientsCa.certificateIssuer` `type` to `internal`.
5. Resume reconciliation.

When using this feature for the ClientsCa:
* On the next Cluster Operator reconciliation Strimzi will roll the Kafka pods to trust the new CA cert.
* On the next User Operator reconciliation Strimzi will issue new certificates for all the existing `KafkaUser` custom resources.

When using this feature for the ClusterCa:
* On the next reconciliation Strimzi will first roll the pods once to trust the new CA cert.
* Strimzi will issue new certificates for all the components.
* Strimzi will roll the pods to use the new certificates.

Once all the pods have been rolled the user can update the `<CLUSTER_NAME>-cluster-ca-cert` and/or `<CLUSTER_NAME>-clients-ca-cert` Secrets to remove the old CA cert.
The user is responsible for removing the old `Certificate` resources and uninstalling cert-manager.

Notes:
* Today we do not document how to go from using user managed CAs to Strimzi managed CAs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling was that it wouldn't be a common scenario for customers to move from a mechanism where they have more control, to a mechanism where they have less control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is even supported to go from custom CA to Strimzi CA. At least I do not remember anyone trying it / testing it. In the long term, I think the idea would be also to get rid of the internal CA management - so not sure this would be a issue from my point of view.

For this reason I have not included how to go from cert-manager CAs to Strimzi managed CAs.

## Rejected alternatives

### Letting Strimzi infer the CA cert to trust

Certain issuers will include CA cert to trust in the Secret for a specific certificate.
Strimzi could use this cert instead of requiring the user to provide one.
However, this is not recommended.
On the cert-manager [website](https://cert-manager.io/docs/trust/) they explicitly state:
"When configuring the client you should independently choose and fetch the CA certificates that you want to trust.
Download the CA out of band and store it in a Secret or ConfigMap separate from the Secret containing the server's private key and certificate."
To keep to this best practice and also allow Strimzi to have the same behaviour for all issuers I have chosen to require the user to provide the CA certs to trust up-front.

### Using lower-level cert-manager CRs

Strimzi could keep control of when to renew/replace certificates/keys and instead use the lower-level custom resources such as `CertificateRequest`.
I chose not to do this since part of the motivation for this feature is to offload certificate management to a dedicated tool.

### Strimzi only interacting with Secrets

Strimzi could not interact with cert-manager custom resources at all and instead just deal with the resulting Secrets directly.
This could work for the ClientsCa, however we already provide the option for users to configure listener certificates, so there is no need for an alternative option.
For the ClusterCa the certificates needed are complex, since there are multiple different nodes and network connections.
It would be very complex for the user to hand-craft the right certificates, and would also restrict their ability to scale up the cluster,
since they would need to create the new certificates up front.
For these reasons it makes sense for Strimzi to create the `Certificate` custom resources.