Quick clarification: you say broker 0, but do you actually mean broker
25? 25 one of the replicas for the partition, is currently the one
having trouble getting into sync, and 28 is the leader for the partition.
Unfortunately, the logs of rotated off so I can't get to what happened
around then. However there was a time period of a few hours where we had
two brokers that both believed they were controllers. I'm not sure why I
didn't think of this before.
ZooKeeper data appears to be inconsistent at the moment.
/brokers/topics/click_engage says that partition 116's replica set is:
[4, 7, 25]. /brokers/topics/click_engage/partitions/116/state says the
leader is 28 and the ISR is [28, 15]. Does this need to be resolved, and
if so how?
Thanks,
Wes
Jiangjie Qin <mailto:j...@linkedin.com.INVALID>
April 21, 2015 at 2:24 PM
This means that the broker 0 thought broker 28 was leader for that
partition but broker 28 has actually already received
StopReplicaRequest from controller and stopped serving as a replica
for that partition.
This might happen transiently but broker 0 will be able to find the
new leader for the partition once it receive LeaderAndIsrRequest from
controller to update the new leader information. If these messages
keep got logged for long time then there might be an issue.
Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on
broker 28 to see if there is some error log. The error log might not
have partition info included.
From: Wes Chow <w...@chartbeat.com <mailto:w...@chartbeat.com>>
Reply-To: "users@kafka.apache.org <mailto:users@kafka.apache.org>"
<users@kafka.apache.org <mailto:users@kafka.apache.org>>
Date: Tuesday, April 21, 2015 at 10:50 AM
To: "users@kafka.apache.org <mailto:users@kafka.apache.org>"
<users@kafka.apache.org <mailto:users@kafka.apache.org>>
Subject: Re: partition reassignment stuck
Not for that particular partition, but I am seeing these errors on 28:
kafka.common.NotAssignedReplicaException: Leader 28 failed to record
follower 25's position 0 for partition [click_engage,116] since the
replica 25 is not recognized to be one of the assigned r
eplicas for partition [click_engage,116]
at
kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
at
kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
at
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
at
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
at
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
at
kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
at
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
at java.lang.Thread.run(Thread.java:745)
What does this mean?
Thanks!
Wes
Wes Chow <mailto:w...@chartbeat.com>
April 21, 2015 at 1:50 PM
Not for that particular partition, but I am seeing these errors on 28:
kafka.common.NotAssignedReplicaException: Leader 28 failed to record
follower 25's position 0 for partition [click_engage,116] since the
replica 25 is not recognized to be one of the assigned r
eplicas for partition [click_engage,116]
at
kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
at
kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
at
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
at
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
at
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
at
kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
at
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
at java.lang.Thread.run(Thread.java:745)
What does this mean?
Thanks!
Wes
Jiangjie Qin <mailto:j...@linkedin.com.INVALID>
April 21, 2015 at 1:19 PM
Those 00000000000000000000.index files are for different partitions and
they should be generated if new replicas is assigned to the broker.
We might want to know what caused the UnknownException. Did you see any
error log on broker 28?
Jiangjie (Becket) Qin
Wes Chow <mailto:w...@chartbeat.com>
April 21, 2015 at 12:16 PM
I started a partition reassignment (this is a 8.1.1 cluster) some time
ago and it seems to be stuck. Partitions are no longer getting moved
around, but it seems like the cluster is operational otherwise. The
stuck nodes have a lot of 00000000000000000000.index files, and their
logs show errors like:
[2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR
kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-28],
Error for partition [pings,227] to broker 28:class
kafka.common.UnknownException
I'm at a loss as to what I should be looking at. Any ideas?
Thanks,
Wes