[ https://issues.apache.org/jira/browse/KAFKA-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on KAFKA-4959 started by Onur Karaman. ------------------------------------------- > remove controller concurrent access to non-threadsafe NetworkClient, > Selector, and SSLEngine > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4959 > URL: https://issues.apache.org/jira/browse/KAFKA-4959 > Project: Kafka > Issue Type: Bug > Reporter: Onur Karaman > Assignee: Onur Karaman > > This brought down a cluster by causing continuous controller moves. > ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects > that aren't thread-safe: > * Selector > * NetworkClient > * SSLEngine (this was the big one for us. We turn on SSL for interbroker > communication). > As per the "Concurrency Notes" section from the [SSLEngine > javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]: > bq. two threads must not attempt to call the same method (either wrap() or > unwrap()) concurrently > SSLEngine.wrap gets called in: > * SslTransportLayer.write > * SslTransportLayer.handshake > * SslTransportLayer.close > It turns out that the ZkEventThread and RequestSendThread can concurrently > call SSLEngine.wrap: > * ZkEventThread calls SslTransportLayer.close from > ControllerChannelManager.removeExistingBroker > * RequestSendThread can call SslTransportLayer.write or > SslTransportLayer.handshake from NetworkClient.poll > Suppose the controller moves for whatever reason. The former controller could > have had a RequestSendThread who was in the middle of sending out messages to > the cluster while the ZkEventThread began executing > KafkaController.onControllerResignation, which calls > ControllerChannelManager.shutdown, which sequentially cleans up the > controller-to-broker queue and connection for every broker in the cluster. > This cleanup includes the call to > ControllerChannelManager.removeExistingBroker as mentioned earlier, causing > the concurrent call to SSLEngine.wrap. This concurrent call throws a > BufferOverflowException which ControllerChannelManager.removeExistingBroker > catches so the ControllerChannelManager.shutdown moves onto cleaning up the > next controller-to-broker queue and connection, skipping the cleanup steps > such as clearing the queue, stopping the RequestSendThread, and removing the > entry from its brokerStateInfo map. > By failing out of the Selector.close, the sensors corresponding to the broker > connection has not been cleaned up. Any later attempt at initializing an > identical Selector will result in a sensor collision and therefore cause > Selector initialization to throw an exception. In other words, any later > attempts by this broker to become controller again will fail on > initialization. When controller initialization fails, the controller deletes > the /controller znode and lets another broker take over. > Now suppose the controller moves enough times such that every broker hits the > BufferOverflowException concurrency issue. We're now guaranteed to fail > controller initialization due to the sensor collision on every controller > transition, so the controller will move across brokers continuously. -- This message was sent by Atlassian JIRA (v6.3.15#6346)