Hi all, recently we ran into a bug which was supposedly fixed in https://issues.apache.org/jira/browse/KAFKA-8190 (we were running Kafka 2.1.1). We use SSL for interbroker communication as well as for communication with clients. The new certificate was rendered under the same filename, which would trigger the aforementioned bug and not cause an actual reload. In line with the bug description, our interbroker communications broke after the old certificate expiration time, seemingly whenever SSL channel had to be created anew, for example, when a new topic was created it couldn't be replicated across the cluster. Old connections, however, kept working, probably as SSL channel was already established. What's strange though is that our newly established client connections also worked. As we're not skipping validating the server certificate on the client, it makes me think that somehow part of new SSL channels which would face clients picked up the correct certificate, while interbroker ones did not. Then again, it - frankly - doesn't make a lot of sense, as they're all using the same port. I'm still trying to reproduce the bug and don't quite get what was going on.
But that made me think, could we maybe introduce a metric that would report all existing SSLContext's certificate expiry dates? Do you think such a metric would make sense at all? The reasoning is - this way Kafka users would get a chance to monitor the exact things loaded in memory, instead relying on correctness of dynamic reloading logic. Maybe it would make sense to report all the public certificate data? In case of any errors with the reloading logic in the future, this would help. Thanks in advance, Pavel.