Hello, I've updated the KIP. If there are no further comments, when can we proceed to the vote?
KIP: https://cwiki.apache.org/confluence/x/to08G Regards, Skander On Mon, Mar 9, 2026 at 1:14 AM Jakub Scholz <[email protected]> wrote: > *> That said, I’m not sure I fully understand why you mentioned that this> > would not work with mTLS and would only be useful for reloading server> > certificates.* > > So imagine the following scenario: > * You use mTLS on the internal or control plane listeners used by the Kafka > nodes to talk with each other. > * That means that the truststore needs to contain the CA that is used to > sign all the server certificates of the other Kafka nodes. And the keystore > needs to have a server/client key that is signed by a CA in the truststore > of the other Kafka nodes. > * Without that, the communication within your cluster would fall apart. > > So, when I need to move to use a new CA, what do I need to do? > * First, roll the new CA alongside the old CA into the truststore of all > the Kafka nodes > * Once I know on 100% that all of the nodes trust both the old and new CA, > I can roll out the new server certificates. > * At this point, the cluster still works, because the nodes using the old > server certificate are trusted by the old CA, and the new server > certificates are trusted by the new CA. > * Only once I'm 100% sure that all Kafka nodes use the new server > certificate, I can remove the old CA from the truststores. > > Doing this in the step-based approach is important because at any point of > time, things still work fine, and any random restart and so on will > not break anything. > > I do not think this is necessarily a rare scenario. Using private CAs is > common - especially with mTLS. And I do think there is a demand for > short-lived CAs. For example, because certificate revocation is hard etc. > Sure, they won't be 100 minutes short-lived. But for example, 15 days > short-lived. > > Obviously, as I said, not everyone might need this. So while that might be > limitations for some users, some would not care. > > *> How about a metric for the sha256 hash of the contents of the > truststore?* > *> Since the hash is 256 bit wide, we can split it into 4x64bit (long) > chunks and have 4 "tags" on the metric, one for each chunk. That way we > limit the > cardinality of the metric to 4. What do you think?* > > I think that would work, yes. We could query the metrics through JMX or > something to get the value and compare it. That would allow us to integrate > it. > > Thanks & Regards > Jakub > > > On Thu, Mar 5, 2026 at 11:23 AM Skander Soltane <[email protected] > > > wrote: > > > Hello Jakub, Gaurav, > > > > Thank you both for your feedback. > > > > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula <[email protected]> wrote: > > > > > Hi Skander and Jakub, > > > > > > Please find my comments inline > > > > > > > On 4 Mar 2026, at 17:58, Jakub Scholz <[email protected]> wrote: > > > > > > > > Hi Skander, > > > > > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > > > > > I think using a poller instead of the WatchService is a good choice. > In > > > the > > > > previous KIP (KIP-1119), this was my main concern about why it would > > not > > > > work. > > > > > > > > However, are you sure that Files.getLastModifiedTime() will work on > > > > Kubernetes with something like a mounted ConfigMap or Secret? The > file > > > > itself is a symlink, and its dates do not change when a Secret is > > > updated. > > > > At least when observed with something like bash's stat command. Only > > the > > > > dates of the file that the symlink points to change. So, out of my > > head, > > > > I'm not sure which timestamp Java would give you (I haven't tried it, > > to > > > be > > > > honest - I'm just wondering if you did and if it really works). If > the > > > > timestamp doesn't work, maybe one can just read the content of the > file > > > and > > > > store some checksum to compare it with in the next check? > > > > > > GN1: I think `Files.getLastModifiedTime()` has an overload for > accepting > > > LinkOption and if none it passed it follows symlinks. > > > We should be fine as long as the timestamp for the file that the > symlink > > > points to is updated. > > > > > > > I think Gaurav is correct, according to the JavaDoc of > getLastModifiedTime, > > “By default, symbolic links are followed and the file attribute of the > > final target of the link is read.” > > In the Kubernetes setup I used to validate my work on the Kafka client, > the > > PKCS#12 keystore and truststore are mounted via a volume, but they are > > actually generated from Vault Secret Agent (VSO) secrets exposed in > another > > volume. A sidecar container is responsible for creating the stores from > the > > PEM files mounted by VSO and regenerating them whenever VSO rotates the > > certificates. > > That said, you raise a valid point: if the stores were mounted directly > > from Kubernetes Secrets or ConfigMaps, would relying on > getLastModifiedTime > > (which follows the final symbolic link) still be reliable? This needs to > be > > validated. > > If it proves reliable in that scenario, all the better. Otherwise, I can > > switch to computing and comparing a checksum of the files instead and > > update the KIP accordingly. > > > > > > > > > The other part of my comments in KIP-1119 was more about the > usability > > > for > > > > something like Strimzi. I do not think the debounce interval really > > > solves > > > > the issue for us. With Kafka, you have a distributed system with: > > > > * Multiple controllers > > > > * Multiple brokers > > > > * Additional components (e.g., an Operator, Cruise Control, etc.) > > > > > > > > So when I need to, for example, roll out a new Certificate Authority, > > > and I > > > > use mTLS authentication, I have to: > > > > * First, roll out the trust to the new CA to all the components > > > > * Only once all components trust the new CA, I can start rolling out > > the > > > > new server/user certificates > > > > * Once the new user and server certificates are used by all > > components, I > > > > can remove the old CA > > > > > > > > But the debounce interval works only locally within a single Kafka > > node. > > > So > > > > while it allows me to safely reload the certificates within the node, > > > which > > > > is good, it does not help me with the understanding of the state on > the > > > > other nodes. To be able to orchestrate the whole system, I need a way > > to > > > > find out if it has been reloaded in order to proceed with the next > > steps. > > > > For example, open a TCP connection and sniff the actual TLS > > > configuration. > > > > But that is pretty ugly, and leaves a mess in the logs and so on. > > > > > > > > Don't get me wrong. I think this is a useful KIP, and I guess that in > > > many > > > > cases - especially when running things manually - it would work fine. > > It > > > > would also work fine for reloading server certificates only, without > an > > > > mTLS. Which is a useful feature as well, with CAs such as Let's > Encrypt > > > > shortening the validity period of their server certificates. > > > > > > > > But for an automated solution like Strimzi, the main missing feature > > for > > > > the hot-reloading of certificates is not about the auto-reload being > > done > > > > by Kafka. It is an API that would tell us what is the current state > of > > > the > > > > system in order to orchestrate more complicated things. > > > > > > GN2: I think that's a good point and perhaps a pain shared by a few as > > > usually CAs are very long lived (of the order of years). > > > I do agree it would be useful to have an "API" to see the state of the > > > system. How about a metric for the sha256 hash of the contents of the > > > truststore? > > > Since the hash is 256 bit wide, we can split it into 4x64bit (long) > > chunks > > > and have 4 "tags" on the metric, one for each chunk. That way we limit > > the > > > cardinality of the metric to 4. What do you think? > > > > > > Jakub, I see your point about the limitations in setups like Strimzi. > > However, as Gaurav mentioned, in most cases the CA tends to be > long-lived. > > In our setup we use mTLS: client certificates are short-lived (around 100 > > minutes), while server certificates have a longer lifetime. In practice, > CA > > updates are relatively infrequent. > > That said, I’m not sure I fully understand why you mentioned that this > > would not work with mTLS and would only be useful for reloading server > > certificates. Also, for server certificate reloading, isn’t that already > > addressed by KIP-687 <https://cwiki.apache.org/confluence/x/lyfZCQ>? > > > > Gaurav, thank your for the suggestion, I like the idea of exposing a > > metric. Jakub, do you think it could effectively be used as an “API” to > > check the current state of the truststore? > > Regards, > > Skander > > > > > > Regards, > > > Gaurav > > > > > > > > > > > Thanks & Regards > > > > Jakub > > > > > > > > On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane < > > > [email protected]> > > > > wrote: > > > > > > > >> Hi all, > > > >> > > > >> I'd like to start a discussion on a new KIP for SSL hot reload on > the > > > >> client side. > > > >> > > > >> You can find the KIP here : > > > >> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients > > > >> > > > >> I also drafted a PR implementing the KIP as I imagined it: > > > >> https://github.com/apache/kafka/pull/21488 > > > >> > > > >> I'd love to hear your thoughts, especially on the polling approach > vs > > > >> WatchService, the debounce mechanism, and whether the registry > design > > > makes > > > >> sense to you. > > > >> > > > >> Than you! > > > >> Skander > > > >> > > > > > > > > >
