[ 
https://issues.apache.org/jira/browse/KAFKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189024#comment-15189024
 ] 
Flavio Junqueira commented on KAFKA-3173:
-----------------------------------------

I have investigated the two races further and the first one is there but turns 
out to be harmless because we check {{hasStarted}} before adding any message to 
the batch. Consequently, the batch is not left dirty. We should still fix to 
avoid the ugly exception, but it is less critical.

The second race is a real problem. I have been able to reproduce it and it can 
cause either the startup to fail or the zk listener event to be skipped. Here 
is an output from the repro:

{noformat}
[2016-03-10 09:58:21,257] ERROR [Partition state machine on Controller 0]:  
(kafka.controller.PartitionStateMachine:100)
java.lang.Exception
        at 
kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$3.apply(PartitionStateMachine.scala:158)
        at 
kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$3.apply(PartitionStateMachine.scala:158)
        at kafka.utils.Logging$class.error(Logging.scala:100)
        at 
kafka.controller.PartitionStateMachine.error(PartitionStateMachine.scala:44)
        at 
kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
        at 
kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:518)
        at 
kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:505)
        at 
kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:455)
        at 
kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:437)
        at 
kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:437)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:255)
        at 
kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:436)
        at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
[2016-03-10 09:58:21,447] ERROR [Partition state machine on Controller 0]: 
Error while moving some partitions to the online state 
(kafka.controller.PartitionStateMachine:103)
java.lang.IllegalStateException: Controller to broker state change requests 
batch is not empty while creating a new one. Some LeaderAndIsr state changes 
Map(1 -> Map(topic1-0 -> 
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:1)))
 might be lost 
        at 
kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254)
        at 
kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:126)
        at 
kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:71)
        at 
kafka.controller.ControllerFailoverTest.testStartupRace(ControllerFailoverTest.scala:119)
{noformat}

The first exception is induced, just to know what call is concurrent with the 
call to startup. The second exception is due to the batch being dirty when I 
call startup on {{partitionStateMachine}}. It can happen the other way around 
too and the topic update can fail. Wrapping the call to 
{{triggerOnlinePartitionStateChange}} with the controller lock solves the issue.

Unfortunately, I had to instrument the code to trigger the race. It is hard to 
test these cases without being invasive, so I'm inclined to not add test cases 
for this. I'll post the changes I have used to repro the two issues I've 
mentioned. Note that they are test cases, but they don't actually fail because 
the current code catches the illegal state exception.  

> Error while moving some partitions to OnlinePartition state 
> ------------------------------------------------------------
>
>                 Key: KAFKA-3173
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3173
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.0
>            Reporter: Flavio Junqueira
>            Assignee: Flavio Junqueira
>            Priority: Critical
>             Fix For: 0.10.0.0
>
>
> We observed another instance of the problem reported in KAFKA-2300, but this 
> time the error appeared in the partition state machine. In KAFKA-2300, we 
> haven't cleaned up the state in {{PartitionStateMachine}} and 
> {{ReplicaStateMachine}} as we do in {{KafkaController}}.
> Here is the stack trace:
> {noformat}
> 2016-01-29 15:26:51,393] ERROR [Partition state machine on Controller 0]: 
> Error while moving some partitions to OnlinePartition state 
> (kafka.controller.PartitionStateMachine)java.lang.IllegalStateException: 
> Controller to broker state change requests batch is not empty while creating 
> a new one. 
> Some LeaderAndIsr state changes Map(0 -> Map(foo-0 -> (LeaderAndIsrInfo:
> (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0)))
>  might be lost        at 
> kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254)
>         at 
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:144)
>         at 
> kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:517)
>         at 
> kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:504)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:437)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
>         at 
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
>         at 
> kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)        at 
> kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:418)
>         at 
> org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)        at 
> org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to