[ https://issues.apache.org/jira/browse/KAFKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272070#comment-15272070 ]
Flavio Junqueira commented on KAFKA-3173: ----------------------------------------- [~junrao] True, {{onControllerFailover}} has the lock when it runs, and it is the only place where we call {{PartitionStateMachine.startup()}}. The confusing part is that the lock is acquired a few hops upwards in the call path, but it does look like the additional lock isn't necessary. Also, I'm wondering if we even need that controller lock. All the zk events are processed using the ZkClient event thread, and there is just one. The runs I was trying to put together had concurrent zk events being triggered, which was causing the potential problems I raised above. If there is any chance of internal threads racing excluding the ZkClient event thread, then the lock is needed, otherwise it isn't. I don't think we need the change I proposed, so I'll go ahead and close the PR, but we can't resolve this issue until we determine the cases in which we can get a dirty batch, preventing the controller from sending further requests. We need more info on this. One of the possibilities given what I've seen in other logs is simply that there is a transient error while sending a message to a broker in {{ControllerBrokerRequestBatch.sendRequestsToBrokers}}, but we are currently not logging the exception. I was hoping that the originator of the call would log it, but it isn't happen. Perhaps one thing we can do for the upcoming release is to log the exception in the case we observe the problem again. > Error while moving some partitions to OnlinePartition state > ------------------------------------------------------------ > > Key: KAFKA-3173 > URL: https://issues.apache.org/jira/browse/KAFKA-3173 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.0 > Reporter: Flavio Junqueira > Assignee: Flavio Junqueira > Priority: Critical > Fix For: 0.10.0.0 > > Attachments: KAFKA-3173-race-repro.patch > > > We observed another instance of the problem reported in KAFKA-2300, but this > time the error appeared in the partition state machine. In KAFKA-2300, we > haven't cleaned up the state in {{PartitionStateMachine}} and > {{ReplicaStateMachine}} as we do in {{KafkaController}}. > Here is the stack trace: > {noformat} > 2016-01-29 15:26:51,393] ERROR [Partition state machine on Controller 0]: > Error while moving some partitions to OnlinePartition state > (kafka.controller.PartitionStateMachine)java.lang.IllegalStateException: > Controller to broker state change requests batch is not empty while creating > a new one. > Some LeaderAndIsr state changes Map(0 -> Map(foo-0 -> (LeaderAndIsrInfo: > (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0))) > might be lost at > kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254) > at > kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:144) > at > kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:517) > at > kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:504) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:437) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419) > at > kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) at > kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:418) > at > org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842) at > org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)