[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15688295#comment-15688295
 ] 

James Cheng commented on KAFKA-1120:
------------------------------------

I believe we ran into this today.

{noformat}
core@core04 $ grep brokers controller.log.2016-11-22-22
[2016-11-22 22:50:32,883] INFO [Controller 4]: Currently active brokers in the 
cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
[2016-11-22 22:50:32,883] INFO [Controller 4]: Currently shutting brokers in 
the cluster: Set() (kafka.controller.KafkaController)
[2016-11-22 22:51:44,601] INFO [BrokerChangeListener on Controller 4]: Broker 
change listener fired for path /brokers/ids with children 1,2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:51:44,607] INFO [BrokerChangeListener on Controller 4]: Newly 
added brokers: 2, deleted brokers: , all live brokers: 1,2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:55:18,831] DEBUG [Controller 4]: All shutting down brokers: 1 
(kafka.controller.KafkaController)
[2016-11-22 22:55:18,831] DEBUG [Controller 4]: Live brokers: 5,2,3,4 
(kafka.controller.KafkaController)
[2016-11-22 22:57:11,791] INFO [BrokerChangeListener on Controller 4]: Broker 
change listener fired for path /brokers/ids with children 2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:11,980] INFO [BrokerChangeListener on Controller 4]: Newly 
added brokers: , deleted brokers: 1, all live brokers: 2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:11,985] INFO [Controller 4]: Removed ArrayBuffer(1) from list 
of shutting down brokers. (kafka.controller.KafkaController)
[2016-11-22 22:57:43,133] INFO [BrokerChangeListener on Controller 4]: Broker 
change listener fired for path /brokers/ids with children 1,2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:43,359] INFO [BrokerChangeListener on Controller 4]: Newly 
added brokers: 1, deleted brokers: , all live brokers: 1,2,3,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:50,218] DEBUG [Controller 4]: All shutting down brokers: 1 
(kafka.controller.KafkaController)
[2016-11-22 22:57:50,218] DEBUG [Controller 4]: Live brokers: 5,2,3,4 
(kafka.controller.KafkaController)
[2016-11-22 22:58:01,668] DEBUG [Controller 4]: All shutting down brokers: 1 
(kafka.controller.KafkaController)
[2016-11-22 22:58:01,668] DEBUG [Controller 4]: Live brokers: 5,2,3,4 
(kafka.controller.KafkaController)
core@core04 $
{noformat}

At 2016-11-22 22:57:11,791, broker 1 went away, and the controller noticed it.
At 2016-11-22 22:57:43,133, broker 1 came back, and the controller noticed it.
At 2016-11-22 22:57:50,218, the controller said it was "done" with stuff, and 
it doesn't seem to know about broker 1, even though broker 1 is running

> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to