Re: Dynamic configuration of interbroker SSL certificates

Michael Carter Mon, 12 Aug 2019 18:13:07 -0700

Hi Colin,

Thanks for the explanation. Yes, I can see the sense in that, although it was 
certainly a surprise for me. I’ll just have to be a bit more careful about 
instantiating AdminClients in the future.


Cheers,
Michael

> On 13 Aug 2019, at 1:36 am, Colin McCabe <[email protected]> wrote:
> 
> Hi Michael,
> 
> The NetworkClient periodically fetches metadata so that it always knows what 
> the cluster topology is. This also helps it to have some open connections 
> when needed to reduce the latency of operations.
> 
> To be fair, we haven’t thought very much about optimizing this since the 
> overhead seems to be low. The update interval can also be configured.
> 
> Best,
> Colin
> 
> On Sun, Aug 11, 2019, at 16:43, Michael Carter wrote:
>> Hi Rajini,
>> 
>> I just thought I'd let you know I've sorted through my issue now. It turns 
>> out that I was doing something silly. The SSL handshake errors weren't 
>> caused by the brokers talking to themselves but were instead being initiated 
>> from an AdminClient I had running in a seperate process. Ironically the same 
>> AdminClient I was using to rotate the certificates in the first place. I was 
>> naively assuming that because I wasn't asking the AdminClient to do anything 
>> at that exact moment, that it wouldn't be contacting the brokers. Instead, 
>> it seems the AdminClient has an internal thread running that is periodically 
>> trying to get metadata information from the brokers.
>> Although now I'm wondering, why is this continual requesting necessary? Is 
>> there some reason that work isn't deferred until the AdminClient is 
>> specifically asked to do something?
>> 
>> Cheers,
>> Michael
>> 
>>> On 5 Aug 2019, at 3:33 pm, Michael Carter <[email protected]> 
>>> wrote:
>>> 
>>> Hi Rajini,
>>> 
>>> Thanks for the good suggestions. I’ve investigated further and confirmed 
>>> that the issue continues to occur even with hostname verification turned 
>>> off.
>>> I put in the ssl debug option and got some interesting information. I’ve 
>>> attached an example of what the logs output when SSL errors start occurring.
>>> 
>>> From the debug output I can see that the internal IP addresses do form part 
>>> of the alternative names on the certificate, which is good.
>>> What it shows is that the verification is failing due to the fact that it 
>>> is using an out of date certificate (where the validity period has expired).
>>> 
>>> It is interesting that in the attached example, the broker appears to be 
>>> trying to connect to itself, and one end of the connection has the new, 
>>> valid certificate, while the other end has the old, out of date 
>>> certificate, even though my understanding is that it should be one and the 
>>> same certificate.
>>> Does this suggest that there is some network component inside Kafka that 
>>> hasn’t responded to the reconfiguration of of interbroker Keystore property?
>>> 
>>> Cheers,
>>> Michael
>>> <debug-handshake.log>
>>> 
>>>> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <[email protected]> wrote:
>>>> 
>>>> Hi Michael,
>>>> 
>>>> Thanks for the logs.
>>>> 
>>>> When keystore is reconfigured in a stable cluster, we wouldn't expect to
>>>> see any failures even if there is an issue with the certs since existing
>>>> connections continue to work without re-authenticating. When a broker is
>>>> restarted, there could be controller re-election, leader elections etc.
>>>> causing new connections on all brokers, so failures can prop up on all
>>>> brokers.
>>>> 
>>>> When we reconfigure inter-broker keystore, we validate the keystore against
>>>> the truststore on each broker. Since we are seeing handshake failures
>>>> within. a broker (e.g. controller connections), it suggests that the
>>>> validation performed by the actual connection is failing while our
>>>> validation during reconfiguration succeeded. The main difference between
>>>> the two is hostname verification - we disable hostname verification when
>>>> validating configs, but for actual connections this seems to be enabled in
>>>> your configs. Can you verify that your certs contain the host IP address
>>>> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
>>>> out hostname issues, you could rerun your tests with
>>>> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
>>>> i.e. empty string to disable hostname verification.
>>>> 
>>>> If the issue still occurs with hostname verification disabled, it will be
>>>> useful to run the brokers with
>>>> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
>>>> used as well as more details about the handshake failure.
>>>> 
>>>> Regards,
>>>> 
>>>> Rajini
>>>> 
>>>> 
>>>> 
>>>> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
>>>> [email protected]> wrote:
>>>> 
>>>>> Hi Rajini,
>>>>> 
>>>>> Thanks for your response. I've attached some logs from a three node
>>>>> cluster that I was doing this on. For the purpose of this test I'm using
>>>>> certificates with ten minutes validity. The original certificate was valid
>>>>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
>>>>> the third broker at 05:48 (five or so minutes after the original
>>>>> certificates would have expired). At this point the errors start to appear
>>>>> in the logs in volume. A lot of them are errors connecting to the 
>>>>> restarted
>>>>> broker, but not all of them. Some of them are the first broker (which
>>>>> hasn't been restarted) apparently having trouble connecting to itself. I
>>>>> restarted the other brokers at about 05:49 but the errors continued.
>>>>> 
>>>>> To answer your questions:
>>>>> 1) All brokers are dynamically updated with new certs and this works fine
>>>>> without any errors (this just indicates that config update succeeded)
>>>>> 
>>>>> - Yes that is correct.
>>>>> 
>>>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>>>> these are logged from SocketServer in the restarted broker. (this could
>>>>> indicate that some client-side cert was not updated dynamically)
>>>>> 
>>>>> - The errors tend to appear in all the brokers logs. I can also get the
>>>>> same thing just by shutting down one of the brokers. It doesn't actually
>>>>> need to be restarted, since the SslAuthenticationExceptions will appear in
>>>>> the brokers that weren't shutdown or restarted regardless. In fact, I can
>>>>> get a message saying "Failed authentication with <IP address> (SSL
>>>>> handshake failed)" when no broker is running on that IP address at all.
>>>>> Some of the messages indicate a failure of a broker connecting to itself
>>>>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
>>>>> in the logs of the broker running on 10.224.59.10)
>>>>> 
>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>> (this is the actual confirmation that certs from 1) work)
>>>>> 
>>>>> - Unfortunately no, the errors persist even after this. Sometimes, it will
>>>>> recover from this situation if I run another dynamic certificate rotation.
>>>>> But restarting the brokers doesn't seem to clear it. It might be tempting
>>>>> to think that there is something wrong with the new certificates, but
>>>>> they're generated in the exact same way as the original certificates
>>>>> (except for the validity time period), and they seem to be being served
>>>>> correctly after the dynamic rotation. (I'm checking with this command:
>>>>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
>>>>> | openssl x509 -noout -dates ). Also, subsequent updates with similar
>>>>> certificates can sometimes fix the problem.
>>>>> 
>>>>> The fact that it can persist after a restart is what really sticks out to
>>>>> me. Is there any state stored outside of Kafka that could be getting in 
>>>>> the
>>>>> way? Does Zookeeper store anything other than the filename of the 
>>>>> keystore?
>>>>> 
>>>>> Cheers,
>>>>> Michael
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Michael,
>>>>>> 
>>>>>> Thank you for reporting this. It is possible that we missed out some
>>>>>> client-side update in KAFKA-8336. Would it be possible for you to provide
>>>>>> logs from multiple brokers around the time of a broker restart when
>>>>>> handshake failures started occurring so that we can try to work out which
>>>>>> client connection is failing? Corresponding to the handshake failure
>>>>> entry
>>>>>> from SocketServer in the restarted broker, there should a failed client
>>>>>> connection on another broker. If that has a stack trace, it would be
>>>>> useful.
>>>>>> 
>>>>>> Also just to confirm, are these correct:
>>>>>> 
>>>>>> 1) All brokers are dynamically updated with new certs and this works fine
>>>>>> without any errors (this just indicates that config update succeeded)
>>>>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>>>>> these are logged from SocketServer in the restarted broker. (this could
>>>>>> indicate that some client-side cert was not updated dynamically)
>>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>>> (this is the actual confirmation that certs from 1) work)
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Rajini
>>>>>> 
>>>>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm having an issue with dynamic configuration of interbroker SSL
>>>>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give me
>>>>> insight
>>>>>>> on. I've previously posted something similar on the Users email list,
>>>>> but I
>>>>>>> think it will need some help from developers experienced with how the
>>>>>>> networking code in Kafka works.
>>>>>>> 
>>>>>>> I'm trying to use SSL two-way authentication for inter broker
>>>>>>> communication, with short lived SSL certificates, rotatating them
>>>>>>> frequently without needing to do a broker restart. So, on each broker
>>>>> in my
>>>>>>> cluster, I periodically generate a new certificate keystore file, and
>>>>> set
>>>>>>> the "listener.name.interbroker.ssl.keystore.location" broker config
>>>>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER)
>>>>>>> 
>>>>>>> Setting this property works fine, and everything appears ok. And
>>>>> manually
>>>>>>> connecting to the inter broker listener shows it's correctly serving the
>>>>>>> new certificate. But if I ever restart a broker after the original
>>>>>>> certificate has expired (The one the broker started up with, which is no
>>>>>>> longer configured anywhere), then communication failures between brokers
>>>>>>> start to toccur. My logs fill up with messages like this:
>>>>>>> 
>>>>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
>>>>>>> authentication with 10.224.70.3 (SSL handshake failed)
>>>>>>> (org.apache.kafka.common.network.Selector)
>>>>>>> 
>>>>>>> A little bit of extra logging injected into the code tells me that the
>>>>>>> failures are caused by the out of date SSL certificates being used. So
>>>>> it
>>>>>>> seems there are some network components inside Kafka still stuck on the
>>>>> old
>>>>>>> settings.
>>>>>>> This sounds similar to the behaviour described in KAFKA-8336 (
>>>>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked
>>>>> as
>>>>>>> fixed in 2.3.0.
>>>>>>> 
>>>>>>> I've confirmed that all the SslChannelBuilders and SslFactories appear
>>>>> to
>>>>>>> be being reconfigured correctly when the dynamic setting is set. I've
>>>>> tried
>>>>>>> closing all existing KafkaChannels on a reconfiguration event, in order
>>>>> to
>>>>>>> force them to re-open with the new certificates, but the problem
>>>>> persists.
>>>>>>> 
>>>>>>> Does anyone have any idea what other components may be hanging around,
>>>>>>> using the old certificates?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Dynamic configuration of interbroker SSL certificates

Reply via email to