[ 
https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904194#comment-14904194
 ] 

Brian Sung-jin Hong commented on KAFKA-2459:
--------------------------------------------

How's the status of this issue? We use Kafka in AWS EC2. When a Kafka instance 
is terminated, we experience this problem.

I tried to fix this myself. But looking at the code, it seems to need many 
refactoring for this to work out. If no one is working on this issue, can 
anyone give me some guidance for me to proceed?

> Connection backoff/blackout period should start when a connection is 
> disconnected, not when the connection attempt was initiated
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-2459
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2459
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer, producer 
>    Affects Versions: 0.8.2.1
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Manikumar Reddy
>
> Currently the connection code for new clients marks the time when a 
> connection was initiated (NodeConnectionState.lastConnectMs) and then uses 
> this to compute blackout periods for nodes, during which connections will not 
> be attempted and the node is not considered a candidate for leastLoadedNode.
> However, in cases where the connection attempt takes longer than the 
> blackout/backoff period (default 10ms), this results in incorrect behavior. 
> If a broker is not available and, for example, the broker does not explicitly 
> reject the connection, instead waiting for a connection timeout (e.g. due to 
> firewall settings), then the backoff period will have already elapsed and the 
> node will immediately be considered ready for a new connection attempt and a 
> node to be selected by leastLoadedNode for metadata updates. I think it 
> should be easy to reproduce and verify this problem manually by using tc to 
> introduce enough latency to make connection failures take > 10ms.
> The correct behavior would use the disconnection event to mark the end of the 
> last connection attempt and then wait for the backoff period to elapse after 
> that.
> See 
> http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E
>  for the original description of the problem.
> This is related to KAFKA-1843 because leastLoadedNode currently will 
> consistently choose the same node if this blackout period is not handled 
> correctly, but is a much smaller issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to