Hi Colin, Thanks for the explanation. Yes, I can see the sense in that, although it was certainly a surprise for me. I’ll just have to be a bit more careful about instantiating AdminClients in the future.
Cheers, Michael > On 13 Aug 2019, at 1:36 am, Colin McCabe <cmcc...@apache.org> wrote: > > Hi Michael, > > The NetworkClient periodically fetches metadata so that it always knows what > the cluster topology is. This also helps it to have some open connections > when needed to reduce the latency of operations. > > To be fair, we haven’t thought very much about optimizing this since the > overhead seems to be low. The update interval can also be configured. > > Best, > Colin > > On Sun, Aug 11, 2019, at 16:43, Michael Carter wrote: >> Hi Rajini, >> >> I just thought I'd let you know I've sorted through my issue now. It turns >> out that I was doing something silly. The SSL handshake errors weren't >> caused by the brokers talking to themselves but were instead being initiated >> from an AdminClient I had running in a seperate process. Ironically the same >> AdminClient I was using to rotate the certificates in the first place. I was >> naively assuming that because I wasn't asking the AdminClient to do anything >> at that exact moment, that it wouldn't be contacting the brokers. Instead, >> it seems the AdminClient has an internal thread running that is periodically >> trying to get metadata information from the brokers. >> Although now I'm wondering, why is this continual requesting necessary? Is >> there some reason that work isn't deferred until the AdminClient is >> specifically asked to do something? >> >> Cheers, >> Michael >> >>> On 5 Aug 2019, at 3:33 pm, Michael Carter <michael.car...@instaclustr.com> >>> wrote: >>> >>> Hi Rajini, >>> >>> Thanks for the good suggestions. I’ve investigated further and confirmed >>> that the issue continues to occur even with hostname verification turned >>> off. >>> I put in the ssl debug option and got some interesting information. I’ve >>> attached an example of what the logs output when SSL errors start occurring. >>> >>> From the debug output I can see that the internal IP addresses do form part >>> of the alternative names on the certificate, which is good. >>> What it shows is that the verification is failing due to the fact that it >>> is using an out of date certificate (where the validity period has expired). >>> >>> It is interesting that in the attached example, the broker appears to be >>> trying to connect to itself, and one end of the connection has the new, >>> valid certificate, while the other end has the old, out of date >>> certificate, even though my understanding is that it should be one and the >>> same certificate. >>> Does this suggest that there is some network component inside Kafka that >>> hasn’t responded to the reconfiguration of of interbroker Keystore property? >>> >>> Cheers, >>> Michael >>> <debug-handshake.log> >>> >>>> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <rajinisiva...@gmail.com> wrote: >>>> >>>> Hi Michael, >>>> >>>> Thanks for the logs. >>>> >>>> When keystore is reconfigured in a stable cluster, we wouldn't expect to >>>> see any failures even if there is an issue with the certs since existing >>>> connections continue to work without re-authenticating. When a broker is >>>> restarted, there could be controller re-election, leader elections etc. >>>> causing new connections on all brokers, so failures can prop up on all >>>> brokers. >>>> >>>> When we reconfigure inter-broker keystore, we validate the keystore against >>>> the truststore on each broker. Since we are seeing handshake failures >>>> within. a broker (e.g. controller connections), it suggests that the >>>> validation performed by the actual connection is failing while our >>>> validation during reconfiguration succeeded. The main difference between >>>> the two is hostname verification - we disable hostname verification when >>>> validating configs, but for actual connections this seems to be enabled in >>>> your configs. Can you verify that your certs contain the host IP address >>>> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule >>>> out hostname issues, you could rerun your tests with >>>> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`, >>>> i.e. empty string to disable hostname verification. >>>> >>>> If the issue still occurs with hostname verification disabled, it will be >>>> useful to run the brokers with >>>> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs >>>> used as well as more details about the handshake failure. >>>> >>>> Regards, >>>> >>>> Rajini >>>> >>>> >>>> >>>> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter < >>>> michael.car...@instaclustr.com> wrote: >>>> >>>>> Hi Rajini, >>>>> >>>>> Thanks for your response. I've attached some logs from a three node >>>>> cluster that I was doing this on. For the purpose of this test I'm using >>>>> certificates with ten minutes validity. The original certificate was valid >>>>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped >>>>> the third broker at 05:48 (five or so minutes after the original >>>>> certificates would have expired). At this point the errors start to appear >>>>> in the logs in volume. A lot of them are errors connecting to the >>>>> restarted >>>>> broker, but not all of them. Some of them are the first broker (which >>>>> hasn't been restarted) apparently having trouble connecting to itself. I >>>>> restarted the other brokers at about 05:49 but the errors continued. >>>>> >>>>> To answer your questions: >>>>> 1) All brokers are dynamically updated with new certs and this works fine >>>>> without any errors (this just indicates that config update succeeded) >>>>> >>>>> - Yes that is correct. >>>>> >>>>> 2) If one broker is restarted, handshakes failures appear in the logs and >>>>> these are logged from SocketServer in the restarted broker. (this could >>>>> indicate that some client-side cert was not updated dynamically) >>>>> >>>>> - The errors tend to appear in all the brokers logs. I can also get the >>>>> same thing just by shutting down one of the brokers. It doesn't actually >>>>> need to be restarted, since the SslAuthenticationExceptions will appear in >>>>> the brokers that weren't shutdown or restarted regardless. In fact, I can >>>>> get a message saying "Failed authentication with <IP address> (SSL >>>>> handshake failed)" when no broker is running on that IP address at all. >>>>> Some of the messages indicate a failure of a broker connecting to itself >>>>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)" >>>>> in the logs of the broker running on 10.224.59.10) >>>>> >>>>> 3) Once all brokers are restarted, there are no more handshake failures >>>>> (this is the actual confirmation that certs from 1) work) >>>>> >>>>> - Unfortunately no, the errors persist even after this. Sometimes, it will >>>>> recover from this situation if I run another dynamic certificate rotation. >>>>> But restarting the brokers doesn't seem to clear it. It might be tempting >>>>> to think that there is something wrong with the new certificates, but >>>>> they're generated in the exact same way as the original certificates >>>>> (except for the validity time period), and they seem to be being served >>>>> correctly after the dynamic rotation. (I'm checking with this command: >>>>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null >>>>> | openssl x509 -noout -dates ). Also, subsequent updates with similar >>>>> certificates can sometimes fix the problem. >>>>> >>>>> The fact that it can persist after a restart is what really sticks out to >>>>> me. Is there any state stored outside of Kafka that could be getting in >>>>> the >>>>> way? Does Zookeeper store anything other than the filename of the >>>>> keystore? >>>>> >>>>> Cheers, >>>>> Michael >>>>> >>>>> >>>>> >>>>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <rajinisiva...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi Michael, >>>>>> >>>>>> Thank you for reporting this. It is possible that we missed out some >>>>>> client-side update in KAFKA-8336. Would it be possible for you to provide >>>>>> logs from multiple brokers around the time of a broker restart when >>>>>> handshake failures started occurring so that we can try to work out which >>>>>> client connection is failing? Corresponding to the handshake failure >>>>> entry >>>>>> from SocketServer in the restarted broker, there should a failed client >>>>>> connection on another broker. If that has a stack trace, it would be >>>>> useful. >>>>>> >>>>>> Also just to confirm, are these correct: >>>>>> >>>>>> 1) All brokers are dynamically updated with new certs and this works fine >>>>>> without any errors (this just indicates that config update succeeded) >>>>>> 2) If one broker is restarted, handshakes failures appear in the logs and >>>>>> these are logged from SocketServer in the restarted broker. (this could >>>>>> indicate that some client-side cert was not updated dynamically) >>>>>> 3) Once all brokers are restarted, there are no more handshake failures >>>>>> (this is the actual confirmation that certs from 1) work) >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Rajini >>>>>> >>>>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter < >>>>>> michael.car...@instaclustr.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I'm having an issue with dynamic configuration of interbroker SSL >>>>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give me >>>>> insight >>>>>>> on. I've previously posted something similar on the Users email list, >>>>> but I >>>>>>> think it will need some help from developers experienced with how the >>>>>>> networking code in Kafka works. >>>>>>> >>>>>>> I'm trying to use SSL two-way authentication for inter broker >>>>>>> communication, with short lived SSL certificates, rotatating them >>>>>>> frequently without needing to do a broker restart. So, on each broker >>>>> in my >>>>>>> cluster, I periodically generate a new certificate keystore file, and >>>>> set >>>>>>> the "listener.name.interbroker.ssl.keystore.location" broker config >>>>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER) >>>>>>> >>>>>>> Setting this property works fine, and everything appears ok. And >>>>> manually >>>>>>> connecting to the inter broker listener shows it's correctly serving the >>>>>>> new certificate. But if I ever restart a broker after the original >>>>>>> certificate has expired (The one the broker started up with, which is no >>>>>>> longer configured anywhere), then communication failures between brokers >>>>>>> start to toccur. My logs fill up with messages like this: >>>>>>> >>>>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed >>>>>>> authentication with 10.224.70.3 (SSL handshake failed) >>>>>>> (org.apache.kafka.common.network.Selector) >>>>>>> >>>>>>> A little bit of extra logging injected into the code tells me that the >>>>>>> failures are caused by the out of date SSL certificates being used. So >>>>> it >>>>>>> seems there are some network components inside Kafka still stuck on the >>>>> old >>>>>>> settings. >>>>>>> This sounds similar to the behaviour described in KAFKA-8336 ( >>>>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked >>>>> as >>>>>>> fixed in 2.3.0. >>>>>>> >>>>>>> I've confirmed that all the SslChannelBuilders and SslFactories appear >>>>> to >>>>>>> be being reconfigured correctly when the dynamic setting is set. I've >>>>> tried >>>>>>> closing all existing KafkaChannels on a reconfiguration event, in order >>>>> to >>>>>>> force them to re-open with the new certificates, but the problem >>>>> persists. >>>>>>> >>>>>>> Does anyone have any idea what other components may be hanging around, >>>>>>> using the old certificates? >>>>>>> >>>>>>> Thanks, >>>>>>> Michael >>>>>>> >>>>>>> >>>>> >>>>> >>> >> >>