0.8.1.1 fixed an issue when the controller could hang during soft failure.

Out-of sync replicas will catch up from the leader automatically, once
restarted.

Thanks,

Jun


On Mon, May 12, 2014 at 12:37 PM, Ian Friedman <i...@flurry.com> wrote:

> You're probably right that it has to be GC given the CMS activity I saw in
> the log but I didn't see a single concurrent mode failure, which bothers me
> that we had this happen anyway... also the ZK timeout is set to a very
> large number... I dunno, seems weird.
>
> I will see what I can do about getting 0.8.1.1 deployed... how do you
> expect it to address this problem?
>
> --Ian
>
>
> On May 12, 2014, at 10:49 AM, Jun Rao <jun...@gmail.com> wrote:
>
> Could you try 0.8.1.1, which fixed some bugs related to controller hanging
> during shutdown.
>
> For ZK session expiration, the #1 cause is GC. We have also seen that
> transient issues with the underlying storage (e.g. raid controller reset)
> can also cause ZK session expiration.
>
> As for the data loss in the producer, you are probably using ack=1, which
> could lead to data loss during leader failover. It just that in this case,
> the failover window is forever due to some bugs.
>
> Thanks,
>
> Jun
>
>
> On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <i...@flurry.com> wrote:
>
>> Jun - We're using 0.8.1
>> The timestamps in the last few entries in controller.log seem to
>> correspond to when the trouble started in server.log
>>
>> controller.log:
>>
>> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
>> expired; shut down all controller components and try to re-elect
>> (kafka.controller.KafkaController$SessionExpirationListener)
>>
>> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
>> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>>
>> server.log:
>>
>> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
>> (kafka.network.Processor)
>>
>> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,109] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for
>> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state
>> (kafka.utils.ZkUtils$)
>>
>> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
>> 1,2,3 to 1 (kafka.cluster.Partition)
>>
>> [2014-05-08 19:01:30,111] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-High/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
>> (kafka.utils.ZkUtils$)
>>
>> I'm also concerned why this happened at all... I do see some slightly
>> high CMS activity in the GC log but I don't see any Full GCs and the whole
>> thing was over within a minute... I'm a bit concerned how something like
>> this could make a broker fail this way!
>>
>> One of the after effects we found from leaving the broker running in this
>> state is that it was still allowing producers with cached metadata to
>> produce to it, so we lost a whole bunch of messages before I shut him down.
>> What's up with that??
>>
>>
>> --Ian
>>
>>
>> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <jun...@gmail.com> wrote:
>>
>>> It seems that broker 1 had a soft failure (potentially due to GC).
>>> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
>>> controller. Which version of Kafka are you using? In the controller log, do
>>> you see broker 1 being added back as a live broker?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>>
>>> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <i...@flurry.com> wrote:
>>>
>>>> This seems similar to behavior we’re seeing. At some point one of our
>>>> brokers (id 1) just gives up and starts throwing those errors and
>>>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>>>> say something very odd:
>>>>
>>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8]
>>>> on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>>>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>>>> to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>>>> (kafka.utils.ZkUtils$)
>>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>>>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>>>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>> (kafka.utils.ZkUtils$)
>>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>>>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3]
>>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>>>> 1,2,3 to 1 (kafka.cluster.Partition)
>>>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>>>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>>> and expected version 9 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>>>> (kafka.utils.ZkUtils$)
>>>>
>>>> etc. And these errors continue every few seconds.
>>>>
>>>> kafka-topics.sh —describe output:
>>>> Topic:callbackServiceTopic-High PartitionCount:50
>>>> ReplicationFactor:3     Configs:
>>>>         Topic: callbackServiceTopic-High        Partition: 0    Leader:
>>>> 2       Replicas: 3,1,2 Isr: 2,3
>>>>         Topic: callbackServiceTopic-High        Partition: 1    Leader:
>>>> 2       Replicas: 1,2,3 Isr: 2,3
>>>>         Topic: callbackServiceTopic-High        Partition: 2    Leader:
>>>> 2       Replicas: 2,3,1 Isr: 2,3
>>>>
>>>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>>>
>>>> [zk: localhost:2181(CONNECTED) 2] get
>>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>>
>>>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>>>
>>>> What does the version number there represent and how does it get out of
>>>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>>>> leader_epoch significant?
>>>>
>>>> Still trying to figure out Kafka operations :(
>>>> —Ian
>>>>
>>>>
>>>> On Apr 24, 2014, at 9:26 PM, 陈小军 <chenxiao...@nhn.com> wrote:
>>>>
>>>> > I don't do any partition reassignment.
>>>> >
>>>> > When broker occure following error, this phenomenon  will happen.
>>>> >
>>>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>>>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>>>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>>> and expected version 7 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1]
>>>> on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from
>>>> 0,2 to 0 (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>>>> and expected version 8 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1]
>>>> on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on
>>>> broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>>>> (kafka.cluster.Partition)
>>>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>>>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>>> and expected version 7 failed due to
>>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>>>> (kafka.utils.ZkUtils$)
>>>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on
>>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>>> updating ISR (kafka.cluster.Partit
>>>> > Best Regards
>>>> > Jerry
>>>> > -----Original Message-----
>>>> > From: "Jun Rao"&lt;jun...@gmail.com&gt;
>>>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>>>> lt;chenxiao...@nhn.com&gt;;
>>>> > Cc:
>>>> > Sent: 2014-04-25 (星期五) 02:12:02
>>>> > Subject: Re: question about isr
>>>> >
>>>> > Interesting. Which version of Kafka are you using? Were you doing
>>>> some partition reassignment? Thanks, Jun
>>>> >
>>>> >
>>>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiao...@nhn.com&gt;
>>>> wrote:
>>>> >
>>>> > Hi Team,
>>>> >
>>>> >   I found a strange phenomenon of isr list in my kafka cluster
>>>> >
>>>> >
>>>> >
>>>> >   When I use the tool that kafka provide to get the topic
>>>> information, and it show isr list as following, seem it is ok
>>>> >
>>>> >
>>>> >
>>>> >     [irt...@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>>>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>>>> 10.96.250.217:10013/nelo2-kafka
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>>>>   Configs:
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>>>>   Replicas: 3,0   Isr: 0,3
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>>>>   Replicas: 0,1   Isr: 0,1
>>>> >
>>>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>>>>   Replicas: 1,3   Isr: 1,3
>>>> >
>>>> >
>>>> >
>>>> >  but when I use some sdk to get the meta info from broker, the isr is
>>>> different.
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 2,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 0,
>>>> >
>>>> >  leader: 3,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 3, 0 ],
>>>> >
>>>> >  isrNum: 2,
>>>> >
>>>> >  isr: [ 0, 3 ] }
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 1,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 1,
>>>> >
>>>> >  leader: 0,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 0, 1 ],
>>>> >
>>>> >  isrNum: 2,
>>>> >
>>>> >  isr: [ 0, 1 ] }
>>>> >
>>>> > metadata:  { size: 246,
>>>> >
>>>> >  correlationId: 0,
>>>> >
>>>> >  brokerNum: -1,
>>>> >
>>>> >  nodeId: 1,
>>>> >
>>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>>> >
>>>> >  port: 9093,
>>>> >
>>>> >  topicNum: 0,
>>>> >
>>>> >  topicError: 0,
>>>> >
>>>> >  topic: 'nelo2-normal-logs',
>>>> >
>>>> >  partitionNum: 0,
>>>> >
>>>> >  errorCode: 0,
>>>> >
>>>> >  partition: 2,
>>>> >
>>>> >  leader: 1,
>>>> >
>>>> >  replicasNum: 2,
>>>> >
>>>> >  replicas: [ 1, 3 ],
>>>> >
>>>> >  isrNum: 1,
>>>> >
>>>> >  isr: [ 1 ] }
>>>> >
>>>> >
>>>> >
>>>> > I also use other sdk, get the same result. I check the logs from
>>>> kafka, it seems the sdk result is right. the tool get the wrong result. why
>>>> is it happend?
>>>> >
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>>>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>>>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>>>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Thanks~!
>>>> >
>>>> >
>>>> >
>>>> > Best Regards
>>>> >
>>>> > Jerry
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>
>

Reply via email to