[
https://issues.apache.org/jira/browse/KAFKA-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on KAFKA-4959 started by Onur Karaman.
-------------------------------------------
> remove controller concurrent access to non-threadsafe NetworkClient,
> Selector, and SSLEngine
> --------------------------------------------------------------------------------------------
>
> Key: KAFKA-4959
> URL: https://issues.apache.org/jira/browse/KAFKA-4959
> Project: Kafka
> Issue Type: Bug
> Reporter: Onur Karaman
> Assignee: Onur Karaman
>
> This brought down a cluster by causing continuous controller moves.
> ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects
> that aren't thread-safe:
> * Selector
> * NetworkClient
> * SSLEngine (this was the big one for us. We turn on SSL for interbroker
> communication).
> As per the "Concurrency Notes" section from the [SSLEngine
> javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]:
> bq. two threads must not attempt to call the same method (either wrap() or
> unwrap()) concurrently
> SSLEngine.wrap gets called in:
> * SslTransportLayer.write
> * SslTransportLayer.handshake
> * SslTransportLayer.close
> It turns out that the ZkEventThread and RequestSendThread can concurrently
> call SSLEngine.wrap:
> * ZkEventThread calls SslTransportLayer.close from
> ControllerChannelManager.removeExistingBroker
> * RequestSendThread can call SslTransportLayer.write or
> SslTransportLayer.handshake from NetworkClient.poll
> Suppose the controller moves for whatever reason. The former controller could
> have had a RequestSendThread who was in the middle of sending out messages to
> the cluster while the ZkEventThread began executing
> KafkaController.onControllerResignation, which calls
> ControllerChannelManager.shutdown, which sequentially cleans up the
> controller-to-broker queue and connection for every broker in the cluster.
> This cleanup includes the call to
> ControllerChannelManager.removeExistingBroker as mentioned earlier, causing
> the concurrent call to SSLEngine.wrap. This concurrent call throws a
> BufferOverflowException which ControllerChannelManager.removeExistingBroker
> catches so the ControllerChannelManager.shutdown moves onto cleaning up the
> next controller-to-broker queue and connection, skipping the cleanup steps
> such as clearing the queue, stopping the RequestSendThread, and removing the
> entry from its brokerStateInfo map.
> By failing out of the Selector.close, the sensors corresponding to the broker
> connection has not been cleaned up. Any later attempt at initializing an
> identical Selector will result in a sensor collision and therefore cause
> Selector initialization to throw an exception. In other words, any later
> attempts by this broker to become controller again will fail on
> initialization. When controller initialization fails, the controller deletes
> the /controller znode and lets another broker take over.
> Now suppose the controller moves enough times such that every broker hits the
> BufferOverflowException concurrency issue. We're now guaranteed to fail
> controller initialization due to the sensor collision on every controller
> transition, so the controller will move across brokers continuously.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)