[ 
https://issues.apache.org/jira/browse/KAFKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ding Haifeng updated KAFKA-1600:
--------------------------------

    Attachment: kafka_failure_logs.tar.gz

Guozhang and Neha, Thanks for reply.

In the attachment are controller.log and server.log from 2 of total 10 brokers. 
broker.id=6 is the misbehaving controller broker.

controller.log from other brokers are empty at that time. It also proves that 
controller failover didn't happen. server.log from other brokers are much the 
same with these broker and not attached.

Some critical moments I found which could help understanding the logs:
14:30:50 - a new topic "user_action_log_from_history" created.
16:04:51 - topic "user_action_log_from_history" deleted.
16:04:56 - the last line in controller.log from broker 6. The 
ActiveControllerCount metric also decreased to 0 since then.
16:28:48 - another broker (broker.id=1) restarted manually but failed to start. 
Some topic partitions on broker 1 lost their leader and were not readable and 
writeable since then.

What happens later:
We didn’t fully get what was wrong at that time. To bring the production system 
back to work ASAP, we created another Kafka cluster and switched to the new 
cluster. In the post-mortem analysis, we found the clues above and open this 
issue here. Hope it can helps. Also contact me if you need any other 
information.


> Controller failover not working correctly.
> ------------------------------------------
>
>                 Key: KAFKA-1600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1600
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.1
>         Environment: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 
> GNU/Linux
> java version "1.7.0_03"
>            Reporter: Ding Haifeng
>            Assignee: Neha Narkhede
>         Attachments: kafka_failure_logs.tar.gz
>
>
> We are running a 10 node Kafka 0.8.1 cluster and experienced a failure as 
> following. 
> At some time, broker A stopped acting as controller any more. We see this by 
> kafka.controller - KafkaController - ActiveControllerCount in JMX metrics 
> jumped from 1 to 0.
> In the meanwhile, broker A was still running and registering itself in the 
> zookeeper /kafka/controller node. So no other brokers could be elected as new 
> controller.
> Since that the cluster was running without controller. Producers and 
> consumers still worked. But functions requiring a controller such as new 
> topic leader election and topic leader failover were not working any more.
> A force restart of broker A could lead to a controller election and bring the 
> cluster back to a correct state.
> Here is our brief observations. I can provide more necessary informations if 
> needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to