[ https://issues.apache.org/jira/browse/KAFKA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990238#comment-14990238 ]
ASF GitHub Bot commented on KAFKA-2743: --------------------------------------- GitHub user ewencp opened a pull request: https://github.com/apache/kafka/pull/422 KAFKA-2743: Make forwarded task reconfiguration requests asynchronous, run on a separate thread, and backoff before retrying when they fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ewencp/kafka task-reconfiguration-async-with-backoff Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/422.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #422 ---- commit 8a30a78b9222ed8fec5143a41db5cf8e6e9efbc7 Author: Ewen Cheslack-Postava <m...@ewencp.org> Date: 2015-11-03T05:30:32Z KAFKA-2743: Make forwarded task reconfiguration requests asynchronous, run on a separate thread, and backoff before retrying when they fail. ---- > Forwarding task reconfigurations in Copycat can deadlock with rebalances and > has no backoff > ------------------------------------------------------------------------------------------- > > Key: KAFKA-2743 > URL: https://issues.apache.org/jira/browse/KAFKA-2743 > Project: Kafka > Issue Type: Bug > Components: copycat > Reporter: Ewen Cheslack-Postava > Assignee: Ewen Cheslack-Postava > Fix For: 0.9.0.0 > > > There are two issues with the way we're currently forwarding task > reconfigurations. First, the forwarding is performed synchronously in the > DistributedHerder's main processing loop. If node A forwards a task > reconfiguration and node B has started a rebalance process, we can end up > with distributed deadlock because node A will be blocking on the HTTP request > in the thread that would otherwise handle heartbeating and rebalancing. > Second, currently we just retry aggressively with no backoff. In some cases > the node that is currently thought to be the leader will legitimately be down > (it shutdown and the node sending the request didn't rebalance yet), so we > need some backoff to avoid unnecessarily hammering the network and the huge > log files that result from constant errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)