[ https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eno Thereska reassigned KAFKA-2459: ----------------------------------- Assignee: Eno Thereska (was: Manikumar Reddy) > Connection backoff/blackout period should start when a connection is > disconnected, not when the connection attempt was initiated > -------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-2459 > URL: https://issues.apache.org/jira/browse/KAFKA-2459 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, producer > Affects Versions: 0.8.2.1 > Reporter: Ewen Cheslack-Postava > Assignee: Eno Thereska > > Currently the connection code for new clients marks the time when a > connection was initiated (NodeConnectionState.lastConnectMs) and then uses > this to compute blackout periods for nodes, during which connections will not > be attempted and the node is not considered a candidate for leastLoadedNode. > However, in cases where the connection attempt takes longer than the > blackout/backoff period (default 10ms), this results in incorrect behavior. > If a broker is not available and, for example, the broker does not explicitly > reject the connection, instead waiting for a connection timeout (e.g. due to > firewall settings), then the backoff period will have already elapsed and the > node will immediately be considered ready for a new connection attempt and a > node to be selected by leastLoadedNode for metadata updates. I think it > should be easy to reproduce and verify this problem manually by using tc to > introduce enough latency to make connection failures take > 10ms. > The correct behavior would use the disconnection event to mark the end of the > last connection attempt and then wait for the backoff period to elapse after > that. > See > http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E > for the original description of the problem. > This is related to KAFKA-1843 because leastLoadedNode currently will > consistently choose the same node if this blackout period is not handled > correctly, but is a much smaller issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)