[ 
https://issues.apache.org/jira/browse/KAFKA-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620659#comment-14620659
 ] 

Flavio Junqueira commented on KAFKA-2300:
-----------------------------------------

My assessment so far focused on the original problem that has been reported in 
this issue, the one that the controller is stuck because of some spurious 
state. From the description and the comments, 
ControllerBrokerRequestBatch.updateMetadataRequestMap isn't empty at the time a 
broker tries to re-join, which causes the processing of 
KafkaController.onBrokerStartup to fail because a call to 
ControllerBrokerRequestBatch.newBatch   (verifies that three data structures 
are empty, including updateMetadataRequestMap) throws an exception.

>From the description, the update metadata requests have -2 for the leader, 
>which is LeaderDuringDelete. Such requests are put in the update metadata map 
>via a call from onTopicDeletion to controller.sendUpdateMetadataRequest. 
>Interestingly, the call to sendUpdateMetadataRequest adds to the update 
>metadata map by calling brokerRequestBatch.addUpdateMetadataRequestForBrokers 
>and right after invokes brokerRequestBatch.sendRequestsToBrokers. The latter 
>is supposed to clear the update metadata map, which makes me think that there 
>could have been an exception that interrupted the flow, causing the 
>updateMetadataRequestMap to not be cleared. 

I wanted to ask if anyone has anything to add or correct in the analysis so 
far, and I also if [~kharriger] would be able to post the whole log so that I 
can have a look. Ultimately, it'd be great to avoid having the controller 
unable to make progress because of some bad state.

> Error in controller log when broker tries to rejoin cluster
> -----------------------------------------------------------
>
>                 Key: KAFKA-2300
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2300
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Johnny Brown
>            Assignee: Flavio Junqueira
>
> Hello Kafka folks,
> We are having an issue where a broker attempts to join the cluster after 
> being restarted, but is never added to the ISR for its assigned partitions. 
> This is a three-node cluster, and the controller is broker 2.
> When broker 1 starts, we see the following message in broker 2's 
> controller.log.
> {{
> [2015-06-23 13:57:16,535] ERROR [BrokerChangeListener on Controller 2]: Error 
> while handling broker changes 
> (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
> java.lang.IllegalStateException: Controller to broker state change requests 
> batch is not empty while creating a new one. Some UpdateMetadata state 
> changes Map(2 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)),
>  1 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)),
>  3 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)))
>  might be lost 
>   at 
> kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:202)
>   at 
> kafka.controller.KafkaController.sendUpdateMetadataRequest(KafkaController.scala:974)
>   at 
> kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:399)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:371)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>   at kafka.utils.Utils$.inLock(Utils.scala:535)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356)
>   at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568)
>   at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> }}
> {{prod-sver-end}} is a topic we previously deleted. It seems some remnant of 
> it persists in the controller's memory, causing an exception which interrupts 
> the state change triggered by the broker startup.
> Has anyone seen something like this? Any idea what's happening here? Any 
> information would be greatly appreciated.
> Thanks,
> Johnny



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to