[ https://issues.apache.org/jira/browse/KAFKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189024#comment-15189024 ]
Flavio Junqueira commented on KAFKA-3173: ----------------------------------------- I have investigated the two races further and the first one is there but turns out to be harmless because we check {{hasStarted}} before adding any message to the batch. Consequently, the batch is not left dirty. We should still fix to avoid the ugly exception, but it is less critical. The second race is a real problem. I have been able to reproduce it and it can cause either the startup to fail or the zk listener event to be skipped. Here is an output from the repro: {noformat} [2016-03-10 09:58:21,257] ERROR [Partition state machine on Controller 0]: (kafka.controller.PartitionStateMachine:100) java.lang.Exception at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$3.apply(PartitionStateMachine.scala:158) at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$3.apply(PartitionStateMachine.scala:158) at kafka.utils.Logging$class.error(Logging.scala:100) at kafka.controller.PartitionStateMachine.error(PartitionStateMachine.scala:44) at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158) at kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:518) at kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:505) at kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:455) at kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:437) at kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:437) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:255) at kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:436) at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [2016-03-10 09:58:21,447] ERROR [Partition state machine on Controller 0]: Error while moving some partitions to the online state (kafka.controller.PartitionStateMachine:103) java.lang.IllegalStateException: Controller to broker state change requests batch is not empty while creating a new one. Some LeaderAndIsr state changes Map(1 -> Map(topic1-0 -> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:1))) might be lost at kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254) at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:126) at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:71) at kafka.controller.ControllerFailoverTest.testStartupRace(ControllerFailoverTest.scala:119) {noformat} The first exception is induced, just to know what call is concurrent with the call to startup. The second exception is due to the batch being dirty when I call startup on {{partitionStateMachine}}. It can happen the other way around too and the topic update can fail. Wrapping the call to {{triggerOnlinePartitionStateChange}} with the controller lock solves the issue. Unfortunately, I had to instrument the code to trigger the race. It is hard to test these cases without being invasive, so I'm inclined to not add test cases for this. I'll post the changes I have used to repro the two issues I've mentioned. Note that they are test cases, but they don't actually fail because the current code catches the illegal state exception. > Error while moving some partitions to OnlinePartition state > ------------------------------------------------------------ > > Key: KAFKA-3173 > URL: https://issues.apache.org/jira/browse/KAFKA-3173 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.0 > Reporter: Flavio Junqueira > Assignee: Flavio Junqueira > Priority: Critical > Fix For: 0.10.0.0 > > > We observed another instance of the problem reported in KAFKA-2300, but this > time the error appeared in the partition state machine. In KAFKA-2300, we > haven't cleaned up the state in {{PartitionStateMachine}} and > {{ReplicaStateMachine}} as we do in {{KafkaController}}. > Here is the stack trace: > {noformat} > 2016-01-29 15:26:51,393] ERROR [Partition state machine on Controller 0]: > Error while moving some partitions to OnlinePartition state > (kafka.controller.PartitionStateMachine)java.lang.IllegalStateException: > Controller to broker state change requests batch is not empty while creating > a new one. > Some LeaderAndIsr state changes Map(0 -> Map(foo-0 -> (LeaderAndIsrInfo: > (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0))) > might be lost at > kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254) > at > kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:144) > at > kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:517) > at > kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:504) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:437) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419) > at > kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419) > at > kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) at > kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:418) > at > org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842) at > org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)