[ 
https://issues.apache.org/jira/browse/KAFKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272070#comment-15272070
 ] 
Flavio Junqueira commented on KAFKA-3173:
-----------------------------------------

[~junrao] True, {{onControllerFailover}} has the lock when it runs, and it is 
the only place where we call {{PartitionStateMachine.startup()}}. The confusing 
part is that the lock is acquired a few hops upwards in the call path, but it 
does look like the additional lock isn't necessary. Also, I'm wondering if we 
even need that controller lock. All the zk events are processed using the 
ZkClient event thread, and there is just one. The runs I was trying to put 
together had concurrent zk events being triggered, which was causing the 
potential problems I raised above. If there is any chance of internal threads 
racing excluding the ZkClient event thread, then the lock is needed, otherwise 
it isn't.

I don't think we need the change I proposed, so I'll go ahead and close the PR, 
but we can't resolve this issue until we determine the cases in which we can 
get a dirty batch, preventing the controller from sending further requests. We 
need more info on this. One of the possibilities given what I've seen in other 
logs is simply that there is a transient error while sending a message to a 
broker in {{ControllerBrokerRequestBatch.sendRequestsToBrokers}}, but we are 
currently not logging the exception. I was hoping that the originator of the 
call would log it, but it isn't happen. Perhaps one thing we can do for the 
upcoming release is to log the exception in the case we observe the problem 
again.

> Error while moving some partitions to OnlinePartition state 
> ------------------------------------------------------------
>
>                 Key: KAFKA-3173
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3173
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.0
>            Reporter: Flavio Junqueira
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 0.10.0.0
>
>         Attachments: KAFKA-3173-race-repro.patch
>
>
> We observed another instance of the problem reported in KAFKA-2300, but this 
> time the error appeared in the partition state machine. In KAFKA-2300, we 
> haven't cleaned up the state in {{PartitionStateMachine}} and 
> {{ReplicaStateMachine}} as we do in {{KafkaController}}.
> Here is the stack trace:
> {noformat}
> 2016-01-29 15:26:51,393] ERROR [Partition state machine on Controller 0]: 
> Error while moving some partitions to OnlinePartition state 
> (kafka.controller.PartitionStateMachine)java.lang.IllegalStateException: 
> Controller to broker state change requests batch is not empty while creating 
> a new one. 
> Some LeaderAndIsr state changes Map(0 -> Map(foo-0 -> (LeaderAndIsrInfo:
> (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0)))
>  might be lost        at 
> kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254)
>         at 
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:144)
>         at 
> kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:517)
>         at 
> kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:504)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:437)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
>         at 
> kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)        at 
> kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:418)
>         at 
> org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)        at 
> org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to