[ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233288#comment-15233288
 ] 

Robert Christ commented on KAFKA-3042:
--------------------------------------

Hello again,

We have managed to reproduce the problem and have a snapshot of the logs.
The tarball is about a gigabyte.  What should I do with it?

We spent a couple hours reproducing it and lots of exciting things seemed to 
happen.
I've added a basic timeline of our attempts to reproduce.  The pattern is 
essentially hard
kill the controller until bad things happen.  The first pass the controller 
only moved once
and we did not see any issues.  The second pass, the controller moved once and 
seemed
to work fine but 10 minutes later the controller moved twice more and we
see the "zkVersion" symptom but only briefly and the broker managed to recover.
The third time the controller bounced many times and we ended up seeing the 
problem
on broker 1 and broker 2 for a sustained period until we started restarting 
brokers
to try to recover the cluster.

The logs starting at 00:35 should be where we initiated the shutdown and were 
able
to reproduce the problem.

controller hard kill (5) 22:42:19
4 took controller
first try did not reproduce

controller hard kill (4) 22:48:10
2 took controller

1 took controller (22:59:06)
4 took controller (23:00:52)

1 showing cached zkVersion messages (23:00:13)
1 no longer showing cached zkVersion messages (23:01:10)

controller hard kill (4) 00:35:26
3 took controller 00:36:27
2 took controller 00:37:30
3 took controller 00:37:54
1 took controller 00:39:07

my zkCli.sh session which I was using to watch the controller exited here so I 
was disconnected for a minute

3 took controller 00:39:57
5 took controller 00:40:54

broker 1 has zkVersion problem
broker 2 has zkVersion problem



broker 1 controlled shutdown to fix (00:56:46)
broker 2 controlled shutdown to fix (00:58:54)

broker 2 appears to be shutdown but hasn't exited yet
hard kill broker 2 (01:00:38)

reverting back to 60000ms timeout and restarting all the brokers (01:05:34)
broker 1 first (about a minute before 01:07:21)


> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to