Onur Karaman created KAFKA-4959:
-----------------------------------

             Summary: remove controller concurrent access to non-threadsafe 
NetworkClient, Selector, and SSLEngine
                 Key: KAFKA-4959
                 URL: https://issues.apache.org/jira/browse/KAFKA-4959
             Project: Kafka
          Issue Type: Bug
            Reporter: Onur Karaman
            Assignee: Onur Karaman


This brought down a cluster by causing continuous controller moves.

ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects 
that aren't thread-safe:
* Selector
* NetworkClient
* SSLEngine (this was the big one for us. We turn on SSL for interbroker 
communication).

As per the "Concurrency Notes" section from the [SSLEngine 
javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]:
bq. two threads must not attempt to call the same method (either wrap() or 
unwrap()) concurrently

SSLEngine.wrap gets called in:
* SslTransportLayer.write
* SslTransportLayer.handshake
* SslTransportLayer.close

It turns out that the ZkEventThread and RequestSendThread can concurrently call 
SSLEngine.wrap:
* ZkEventThread calls SslTransportLayer.close from 
ControllerChannelManager.removeExistingBroker
* RequestSendThread can call SslTransportLayer.write or 
SslTransportLayer.handshake from NetworkClient.poll

Suppose the controller moves for whatever reason. The former controller could 
have had a RequestSendThread who was in the middle of sending out messages to 
the cluster while the ZkEventThread began executing 
KafkaController.onControllerResignation, which calls 
ControllerChannelManager.shutdown, which sequentially cleans up the 
controller-to-broker queue and connection for every broker in the cluster. This 
cleanup includes the call to ControllerChannelManager.removeExistingBroker as 
mentioned earlier, causing the concurrent call to SSLEngine.wrap. This 
concurrent call throws a BufferOverflowException which 
ControllerChannelManager.removeExistingBroker catches so the 
ControllerChannelManager.shutdown moves onto cleaning up the next 
controller-to-broker queue and connection, skipping the cleanup steps such as 
clearing the queue, stopping the RequestSendThread, and removing the entry from 
its brokerStateInfo map.

By failing out of the Selector.close, the sensors corresponding to the broker 
connection has not been cleaned up. Any later attempt at initializing an 
identical Selector will result in a sensor collision and therefore cause 
Selector initialization to throw an exception. In other words, any later 
attempts by this broker to become controller again will fail on initialization. 
When controller initialization fails, the controller deletes the /controller 
znode and lets another broker take over.

Now suppose the controller moves enough times such that every broker hits the 
BufferOverflowException concurrency issue. We're now guaranteed to fail 
controller initialization due to the sensor collision on every controller 
transition, so the controller will move across brokers continuously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to