Re: partition reassignment stuck

Jiangjie Qin Tue, 21 Apr 2015 18:59:36 -0700

Hard to say, but if you have producers keeping producing data and they
work well then probably you don¹t need to.


On 4/21/15, 5:34 PM, "Wesley Chow" <w...@chartbeat.com> wrote:

>There is only one broker that thinks it's the controller right now.  The
>double controller situation happened earlier this morning. Do the other
>brokers have to be bounced after the controller situation is fixed? I did
>not do that for all brokers.
>
>Wes
> On Apr 21, 2015 8:25 PM, "Jiangjie Qin" <j...@linkedin.com.invalid>
>wrote:
>
>>  Yes, should be broker 25 thread 0 from the log.
>> This needs to be resolved, you might need to bounce both of the brokers
>> who think itself as controller respectively. The new controller should
>>be
>> able to continue the partition reassignment.
>>
>>   From: Wes Chow <w...@chartbeat.com>
>> Reply-To: "users@kafka.apache.org" <users@kafka.apache.org>
>> Date: Tuesday, April 21, 2015 at 1:29 PM
>> To: "users@kafka.apache.org" <users@kafka.apache.org>
>> Subject: Re: partition reassignment stuck
>>
>>
>> Quick clarification: you say broker 0, but do you actually mean broker
>>25?
>> 25 one of the replicas for the partition, is currently the one having
>> trouble getting into sync, and 28 is the leader for the partition.
>>
>> Unfortunately, the logs of rotated off so I can't get to what happened
>> around then. However there was a time period of a few hours where we had
>> two brokers that both believed they were controllers. I'm not sure why I
>> didn't think of this before.
>>
>> ZooKeeper data appears to be inconsistent at the moment.
>> /brokers/topics/click_engage says that partition 116's replica set is:
>>[4,
>> 7, 25]. /brokers/topics/click_engage/partitions/116/state says the
>>leader
>> is 28 and the ISR is [28, 15]. Does this need to be resolved, and if so
>>how?
>>
>> Thanks,
>> Wes
>>
>>   Jiangjie Qin <j...@linkedin.com.INVALID>
>> April 21, 2015 at 2:24 PM
>>   This means that the broker 0 thought broker 28 was leader for that
>> partition but broker 28 has actually already received StopReplicaRequest
>> from controller and stopped serving as a replica for that partition.
>> This might happen transiently but broker 0 will be able to find the new
>> leader for the partition once it receive LeaderAndIsrRequest from
>> controller to update the new leader information. If these messages keep
>>got
>> logged for long time then there might be an issue.
>> Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on
>> broker 28 to see if there is some error log. The error log might not
>>have
>> partition info included.
>>
>>   From: Wes Chow <w...@chartbeat.com>
>> Reply-To: "users@kafka.apache.org" <users@kafka.apache.org>
>> Date: Tuesday, April 21, 2015 at 10:50 AM
>> To: "users@kafka.apache.org" <users@kafka.apache.org>
>> Subject: Re: partition reassignment stuck
>>
>>
>> Not for that particular partition, but I am seeing these errors on 28:
>>
>> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
>> follower 25's position 0 for partition [click_engage,116] since the
>>replica
>> 25 is not recognized to be one of the assigned r
>> eplicas  for partition [click_engage,116]
>>         at
>> 
>>kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:2
>>31)
>>         at
>> 
>>kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:4
>>32)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:460)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:458)
>>         at
>> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>>         at 
>>kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>>         at
>> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> What does this mean?
>>
>> Thanks!
>> Wes
>>
>>
>>     Wes Chow <w...@chartbeat.com>
>> April 21, 2015 at 1:50 PM
>>
>> Not for that particular partition, but I am seeing these errors on 28:
>>
>> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
>> follower 25's position 0 for partition [click_engage,116] since the
>>replica
>> 25 is not recognized to be one of the assigned r
>> eplicas  for partition [click_engage,116]
>>         at
>> 
>>kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:2
>>31)
>>         at
>> 
>>kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:4
>>32)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:460)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:458)
>>         at
>> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>>         at 
>>kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>>         at
>> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> What does this mean?
>>
>> Thanks!
>> Wes
>>
>>
>>    Jiangjie Qin <j...@linkedin.com.INVALID>
>> April 21, 2015 at 1:19 PM
>>   Those 00000000000000000000.index files are for different partitions
>>and
>> they should be generated if new replicas is assigned to the broker.
>> We might want to know what caused the UnknownException. Did you see any
>> error log on broker 28?
>>
>> Jiangjie (Becket) Qin
>>
>>
>>    Wes Chow <w...@chartbeat.com>
>> April 21, 2015 at 12:16 PM
>>   I started a partition reassignment (this is a 8.1.1 cluster) some time
>> ago and it seems to be stuck. Partitions are no longer getting moved
>> around, but it seems like the cluster is operational otherwise. The
>>stuck
>> nodes have a lot of 00000000000000000000.index files, and their logs
>>show
>> errors like:
>>
>> [2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR
>> kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error
>>for
>> partition [pings,227] to broker 28:class kafka.common.UnknownException
>>
>> I'm at a loss as to what I should be looking at. Any ideas?
>>
>> Thanks,
>> Wes
>>
>>

Re: partition reassignment stuck

Reply via email to