Ewen Cheslack-Postava created KAFKA-2459:
--------------------------------------------

             Summary: Connection backoff/blackout period should start when a 
connection is disconnected, not when the connection attempt was initiated
                 Key: KAFKA-2459
                 URL: https://issues.apache.org/jira/browse/KAFKA-2459
             Project: Kafka
          Issue Type: Bug
          Components: clients, consumer, producer 
    Affects Versions: 0.8.2.1
            Reporter: Ewen Cheslack-Postava
            Assignee: Neha Narkhede


Currently the connection code for new clients marks the time when a connection 
was initiated (NodeConnectionState.lastConnectMs) and then uses this to compute 
blackout periods for nodes, during which connections will not be attempted and 
the node is not considered a candidate for leastLoadedNode.

However, in cases where the connection attempt takes longer than the 
blackout/backoff period (default 10ms), this results in incorrect behavior. If 
a broker is not available and, for example, the broker does not explicitly 
reject the connection, instead waiting for a connection timeout (e.g. due to 
firewall settings), then the backoff period will have already elapsed and the 
node will immediately be considered ready for a new connection attempt and a 
node to be selected by leastLoadedNode for metadata updates. I think it should 
be easy to reproduce and verify this problem manually by using tc to introduce 
enough latency to make connection failures take > 10ms.

The correct behavior would use the disconnection event to mark the end of the 
last connection attempt and then wait for the backoff period to elapse after 
that.

See 
http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E
 for the original description of the problem.

This is related to KAFKA-1843 because leastLoadedNode currently will 
consistently choose the same node if this blackout period is not handled 
correctly, but is a much smaller issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to