[ 
https://issues.apache.org/jira/browse/KAFKA-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4959 started by Onur Karaman.
-------------------------------------------
> remove controller concurrent access to non-threadsafe NetworkClient, 
> Selector, and SSLEngine
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4959
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4959
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
>
> This brought down a cluster by causing continuous controller moves.
> ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects 
> that aren't thread-safe:
> * Selector
> * NetworkClient
> * SSLEngine (this was the big one for us. We turn on SSL for interbroker 
> communication).
> As per the "Concurrency Notes" section from the [SSLEngine 
> javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]:
> bq. two threads must not attempt to call the same method (either wrap() or 
> unwrap()) concurrently
> SSLEngine.wrap gets called in:
> * SslTransportLayer.write
> * SslTransportLayer.handshake
> * SslTransportLayer.close
> It turns out that the ZkEventThread and RequestSendThread can concurrently 
> call SSLEngine.wrap:
> * ZkEventThread calls SslTransportLayer.close from 
> ControllerChannelManager.removeExistingBroker
> * RequestSendThread can call SslTransportLayer.write or 
> SslTransportLayer.handshake from NetworkClient.poll
> Suppose the controller moves for whatever reason. The former controller could 
> have had a RequestSendThread who was in the middle of sending out messages to 
> the cluster while the ZkEventThread began executing 
> KafkaController.onControllerResignation, which calls 
> ControllerChannelManager.shutdown, which sequentially cleans up the 
> controller-to-broker queue and connection for every broker in the cluster. 
> This cleanup includes the call to 
> ControllerChannelManager.removeExistingBroker as mentioned earlier, causing 
> the concurrent call to SSLEngine.wrap. This concurrent call throws a 
> BufferOverflowException which ControllerChannelManager.removeExistingBroker 
> catches so the ControllerChannelManager.shutdown moves onto cleaning up the 
> next controller-to-broker queue and connection, skipping the cleanup steps 
> such as clearing the queue, stopping the RequestSendThread, and removing the 
> entry from its brokerStateInfo map.
> By failing out of the Selector.close, the sensors corresponding to the broker 
> connection has not been cleaned up. Any later attempt at initializing an 
> identical Selector will result in a sensor collision and therefore cause 
> Selector initialization to throw an exception. In other words, any later 
> attempts by this broker to become controller again will fail on 
> initialization. When controller initialization fails, the controller deletes 
> the /controller znode and lets another broker take over.
> Now suppose the controller moves enough times such that every broker hits the 
> BufferOverflowException concurrency issue. We're now guaranteed to fail 
> controller initialization due to the sensor collision on every controller 
> transition, so the controller will move across brokers continuously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to