Ewen Cheslack-Postava created KAFKA-2459: --------------------------------------------
Summary: Connection backoff/blackout period should start when a connection is disconnected, not when the connection attempt was initiated Key: KAFKA-2459 URL: https://issues.apache.org/jira/browse/KAFKA-2459 Project: Kafka Issue Type: Bug Components: clients, consumer, producer Affects Versions: 0.8.2.1 Reporter: Ewen Cheslack-Postava Assignee: Neha Narkhede Currently the connection code for new clients marks the time when a connection was initiated (NodeConnectionState.lastConnectMs) and then uses this to compute blackout periods for nodes, during which connections will not be attempted and the node is not considered a candidate for leastLoadedNode. However, in cases where the connection attempt takes longer than the blackout/backoff period (default 10ms), this results in incorrect behavior. If a broker is not available and, for example, the broker does not explicitly reject the connection, instead waiting for a connection timeout (e.g. due to firewall settings), then the backoff period will have already elapsed and the node will immediately be considered ready for a new connection attempt and a node to be selected by leastLoadedNode for metadata updates. I think it should be easy to reproduce and verify this problem manually by using tc to introduce enough latency to make connection failures take > 10ms. The correct behavior would use the disconnection event to mark the end of the last connection attempt and then wait for the backoff period to elapse after that. See http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E for the original description of the problem. This is related to KAFKA-1843 because leastLoadedNode currently will consistently choose the same node if this blackout period is not handled correctly, but is a much smaller issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)