[ https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949651#comment-14949651 ]
ASF GitHub Bot commented on KAFKA-2459: --------------------------------------- GitHub user enothereska opened a pull request: https://github.com/apache/kafka/pull/290 KAFKA-2459: connection backoff, timeouts and retries This fix applies to three JIRAs, since they are all connected. KAFKA-2459Connection backoff/blackout period should start when a connection is disconnected, not when the connection attempt was initiated Backoff when connection is disconnected KAFKA-2615Poll() method is broken wrt time Added Time through the NetworkClient API. Minimal change. KAFKA-1843Metadata fetch/refresh in new producer should handle all node connection states gracefully I’ve partially addressed this for a specific failure case in the JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/enothereska/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #290 ---- commit 90c0085a76374fafe6fa62c18e3d24504852e687 Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-07T00:06:49Z Commits to fix timing issues in three JIRAs commit ee66491fb36d55527d156afda90c3addc3eb3175 Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-07T00:07:21Z Merge remote-tracking branch 'apache-kafka/trunk' into trunk commit 17a373733e414456475217248cbc7b0bc98fda40 Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-07T15:15:19Z Merge remote-tracking branch 'apache-kafka/trunk' into trunk commit eb5fbf458a5b455ae8b3c8b3ebf32524f5a3ab3e Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-07T16:20:45Z Removed debug messages commit 041baae45012cf8f99afd2c8b5d9a8099a8a928b Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-07T17:35:12Z Pick a node, but not one that is blacked out commit 69679d7e61d36f76d2ea1dd1fcc0a1192c9b50d6 Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-08T17:18:02Z Removed unneeded checks commit 3ce5e151396575f45d1f022720f454ac36653d0d Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-08T17:18:18Z Merge remote-tracking branch 'apache-kafka/trunk' into trunk commit 76e6a0d8ab3fe847b28edde2e0072e7fe06484ff Author: Eno Thereska <eno.there...@gmail.com> Date: 2015-10-08T23:35:41Z More efficient implementation of nodesEverSeen ---- > Connection backoff/blackout period should start when a connection is > disconnected, not when the connection attempt was initiated > -------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-2459 > URL: https://issues.apache.org/jira/browse/KAFKA-2459 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, producer > Affects Versions: 0.8.2.1 > Reporter: Ewen Cheslack-Postava > Assignee: Eno Thereska > > Currently the connection code for new clients marks the time when a > connection was initiated (NodeConnectionState.lastConnectMs) and then uses > this to compute blackout periods for nodes, during which connections will not > be attempted and the node is not considered a candidate for leastLoadedNode. > However, in cases where the connection attempt takes longer than the > blackout/backoff period (default 10ms), this results in incorrect behavior. > If a broker is not available and, for example, the broker does not explicitly > reject the connection, instead waiting for a connection timeout (e.g. due to > firewall settings), then the backoff period will have already elapsed and the > node will immediately be considered ready for a new connection attempt and a > node to be selected by leastLoadedNode for metadata updates. I think it > should be easy to reproduce and verify this problem manually by using tc to > introduce enough latency to make connection failures take > 10ms. > The correct behavior would use the disconnection event to mark the end of the > last connection attempt and then wait for the backoff period to elapse after > that. > See > http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E > for the original description of the problem. > This is related to KAFKA-1843 because leastLoadedNode currently will > consistently choose the same node if this blackout period is not handled > correctly, but is a much smaller issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)