[ 
https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048552#comment-16048552
 ] 

Robert P. Thille commented on KAFKA-764:
----------------------------------------

I believe we saw this issue, or something very similar.  
During a load test, we had a 3-node Kafka cluster which got into a confused 
state: 
Brokers 0 and 1 were happy and were listed in /brokers/ids/X in ZK, and Broker 
2 was connected to ZK, but not listed in /brokers/ids/2 and brokers 0 & 1 had 
no connections to broker 2. 
Broker 2 was happily accepting new messages produced to it for hours.  
Eventually, it did rejoin the cluster, but the published messages were lost as 
the 0 & 1 brokers seemingly outvoted broker 2 about the partitions.

> Race Condition in Broker Registration after ZooKeeper disconnect
> ----------------------------------------------------------------
>
>                 Key: KAFKA-764
>                 URL: https://issues.apache.org/jira/browse/KAFKA-764
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Bob Cotton
>
> When running our ZooKeepers in VMware, occasionally all the keepers 
> simultaneously pause long enough for the Kafka clients to time out and then 
> the keepers simultaneously un-pause.
> When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper 
> comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of 
> itself and does not re-register the broker id node and the function call 
> succeeds. Then ZooKeeper figures out the broker disconnected from the keeper 
> and deletes the ephemeral node *after* allowing the consumer to read the data 
> in the /brokers/ids/x node.  The broker then goes on to register all the 
> topics, etc.  When consumers connect, they see topic nodes associated with 
> the broker but thy can't find the broker node to get connection information 
> for the broker, sending them into a rebalance loop until they reach 
> rebalance.retries.max and fail.
> This might also be a ZooKeeper issue, but the desired behavior for a 
> disconnect case might be, if the broker node is found to explicitly delete 
> and recreate it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to