Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients

Skander Soltane Mon, 16 Mar 2026 06:53:43 -0700

Hello,

I've updated the KIP.
If there are no further comments, when can we proceed to the vote?


KIP: https://cwiki.apache.org/confluence/x/to08G

Regards,
Skander


On Mon, Mar 9, 2026 at 1:14 AM Jakub Scholz <[email protected]> wrote:

> *> That said, I’m not sure I fully understand why you mentioned that this>
> would not work with mTLS and would only be useful for reloading server>
> certificates.*
>
> So imagine the following scenario:
> * You use mTLS on the internal or control plane listeners used by the Kafka
> nodes to talk with each other.
> * That means that the truststore needs to contain the CA that is used to
> sign all the server certificates of the other Kafka nodes. And the keystore
> needs to have a server/client key that is signed by a CA in the truststore
> of the other Kafka nodes.
> * Without that, the communication within your cluster would fall apart.
>
> So, when I need to move to use a new CA, what do I need to do?
> * First, roll the new CA alongside the old CA into the truststore of all
> the Kafka nodes
> * Once I know on 100% that all of the nodes trust both the old and new CA,
> I can roll out the new server certificates.
> * At this point, the cluster still works, because the nodes using the old
> server certificate are trusted by the old CA, and the new server
> certificates are trusted by the new CA.
> * Only once I'm 100% sure that all Kafka nodes use the new server
> certificate, I can remove the old CA from the truststores.
>
> Doing this in the step-based approach is important because at any point of
> time, things still work fine, and any random restart and so on will
> not break anything.
>
> I do not think this is necessarily a rare scenario. Using private CAs is
> common - especially with mTLS. And I do think there is a demand for
> short-lived CAs. For example, because certificate revocation is hard etc.
> Sure, they won't be 100 minutes short-lived. But for example, 15 days
> short-lived.
>
> Obviously, as I said, not everyone might need this. So while that might be
> limitations for some users, some would not care.
>
> *> How about a metric for the sha256 hash of the contents of the
> truststore?*
> *> Since the hash is 256 bit wide, we can split it into 4x64bit (long)
> chunks and have 4 "tags" on the metric, one for each chunk. That way we
> limit the > cardinality of the metric to 4. What do you think?*
>
> I think that would work, yes. We could query the metrics through JMX or
> something to get the value and compare it. That would allow us to integrate
> it.
>
> Thanks & Regards
> Jakub
>
>
> On Thu, Mar 5, 2026 at 11:23 AM Skander Soltane <[email protected]
> >
> wrote:
>
> >  Hello Jakub, Gaurav,
> >
> > Thank you both for your feedback.
> >
> > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula <[email protected]> wrote:
> >
> > > Hi Skander and Jakub,
> > >
> > > Please find my comments inline
> > >
> > > > On 4 Mar 2026, at 17:58, Jakub Scholz <[email protected]> wrote:
> > > >
> > > > Hi Skander,
> > > >
> > > > Thanks for the KIP. Here are some of my thoughts on it ...
> > > >
> > > > I think using a poller instead of the WatchService is a good choice.
> In
> > > the
> > > > previous KIP (KIP-1119), this was my main concern about why it would
> > not
> > > > work.
> > > >
> > > > However, are you sure that Files.getLastModifiedTime() will work on
> > > > Kubernetes with something like a mounted ConfigMap or Secret? The
> file
> > > > itself is a symlink, and its dates do not change when a Secret is
> > > updated.
> > > > At least when observed with something like bash's stat command. Only
> > the
> > > > dates of the file that the symlink points to change. So, out of my
> > head,
> > > > I'm not sure which timestamp Java would give you (I haven't tried it,
> > to
> > > be
> > > > honest - I'm just wondering if you did and if it really works). If
> the
> > > > timestamp doesn't work, maybe one can just read the content of the
> file
> > > and
> > > > store some checksum to compare it with in the next check?
> > >
> > > GN1: I think `Files.getLastModifiedTime()` has an overload for
> accepting
> > > LinkOption and if none it passed it follows symlinks.
> > > We should be fine as long as the timestamp for the file that the
> symlink
> > > points to is updated.
> > >
> >
> > I think Gaurav is correct, according to the JavaDoc of
> getLastModifiedTime,
> > “By default, symbolic links are followed and the file attribute of the
> > final target of the link is read.”
> > In the Kubernetes setup I used to validate my work on the Kafka client,
> the
> > PKCS#12 keystore and truststore are mounted via a volume, but they are
> > actually generated from Vault Secret Agent (VSO) secrets exposed in
> another
> > volume. A sidecar container is responsible for creating the stores from
> the
> > PEM files mounted by VSO and regenerating them whenever VSO rotates the
> > certificates.
> > That said, you raise a valid point: if the stores were mounted directly
> > from Kubernetes Secrets or ConfigMaps, would relying on
> getLastModifiedTime
> > (which follows the final symbolic link) still be reliable? This needs to
> be
> > validated.
> > If it proves reliable in that scenario, all the better. Otherwise, I can
> > switch to computing and comparing a checksum of the files instead and
> > update the KIP accordingly.
> >
> > >
> > > > The other part of my comments in KIP-1119 was more about the
> usability
> > > for
> > > > something like Strimzi. I do not think the debounce interval really
> > > solves
> > > > the issue for us. With Kafka, you have a distributed system with:
> > > > * Multiple controllers
> > > > * Multiple brokers
> > > > * Additional components (e.g., an Operator, Cruise Control, etc.)
> > > >
> > > > So when I need to, for example, roll out a new Certificate Authority,
> > > and I
> > > > use mTLS authentication, I have to:
> > > > * First, roll out the trust to the new CA to all the components
> > > > * Only once all components trust the new CA, I can start rolling out
> > the
> > > > new server/user certificates
> > > > * Once the new user and server certificates are used by all
> > components, I
> > > > can remove the old CA
> > > >
> > > > But the debounce interval works only locally within a single Kafka
> > node.
> > > So
> > > > while it allows me to safely reload the certificates within the node,
> > > which
> > > > is good, it does not help me with the understanding of the state on
> the
> > > > other nodes. To be able to orchestrate the whole system, I need a way
> > to
> > > > find out if it has been reloaded in order to proceed with the next
> > steps.
> > > > For example, open a TCP connection and sniff the actual TLS
> > > configuration.
> > > > But that is pretty ugly, and leaves a mess in the logs and so on.
> > > >
> > > > Don't get me wrong. I think this is a useful KIP, and I guess that in
> > > many
> > > > cases - especially when running things manually - it would work fine.
> > It
> > > > would also work fine for reloading server certificates only, without
> an
> > > > mTLS. Which is a useful feature as well, with CAs such as Let's
> Encrypt
> > > > shortening the validity period of their server certificates.
> > > >
> > > > But for an automated solution like Strimzi, the main missing feature
> > for
> > > > the hot-reloading of certificates is not about the auto-reload being
> > done
> > > > by Kafka. It is an API that would tell us what is the current state
> of
> > > the
> > > > system in order to orchestrate more complicated things.
> > >
> > > GN2: I think that's a good point and perhaps a pain shared by a few as
> > > usually CAs are very long lived (of the order of years).
> > > I do agree it would be useful to have an "API" to see the state of the
> > > system. How about a metric for the sha256 hash of the contents of the
> > > truststore?
> > > Since the hash is 256 bit wide, we can split it into 4x64bit (long)
> > chunks
> > > and have 4 "tags" on the metric, one for each chunk. That way we limit
> > the
> > > cardinality of the metric to 4. What do you think?
> > >
> > > Jakub, I see your point about the limitations in setups like Strimzi.
> > However, as Gaurav mentioned, in most cases the CA tends to be
> long-lived.
> > In our setup we use mTLS: client certificates are short-lived (around 100
> > minutes), while server certificates have a longer lifetime. In practice,
> CA
> > updates are relatively infrequent.
> > That said, I’m not sure I fully understand why you mentioned that this
> > would not work with mTLS and would only be useful for reloading server
> > certificates. Also, for server certificate reloading, isn’t that already
> > addressed by KIP-687 <https://cwiki.apache.org/confluence/x/lyfZCQ>?
> >
> > Gaurav, thank your for the suggestion, I like the idea of exposing a
> > metric. Jakub, do you think it could effectively be used as an “API” to
> > check the current state of the truststore?
> > Regards,
> > Skander
> >
> >
> > Regards,
> > > Gaurav
> > >
> > > >
> > > > Thanks & Regards
> > > > Jakub
> > > >
> > > > On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane <
> > > [email protected]>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I'd like to start a discussion on a new KIP for SSL hot reload on
> the
> > > >> client side.
> > > >>
> > > >> You can find the KIP here :
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients
> > > >>
> > > >> I also drafted a PR implementing the KIP as I imagined it:
> > > >> https://github.com/apache/kafka/pull/21488
> > > >>
> > > >> I'd love to hear your thoughts, especially on the polling approach
> vs
> > > >> WatchService, the debounce mechanism, and whether the registry
> design
> > > makes
> > > >> sense to you.
> > > >>
> > > >> Than you!
> > > >> Skander
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-1288 SSL Hot Reload for Kafka Clients

Reply via email to