Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Manish Sharma Sat, 07 Nov 2015 12:08:09 -0800

Same here..
Started running into similar situations almost weekly ever since we
increased the number of partitions on some topics from 6 to 15 and added 3
brokers to our kafka cluster.


Last night I stopped all producers & consumers; restarted the brokers &
zookeepers and restarted producers/consumers.

Today morning I see an endless loop of shrinking ISR -> cached zkVersion
mismatch skip udating ISR again.

*[2015-11-07 11:55:47,260] INFO Partition [Wmt_Saturday_234,10] on broker
0: Shrinking ISR for partition [Wmt_Saturday_234,10] from 0,1 to 0
(kafka.cluster.Partition)*

*[2015-11-07 11:55:47,267] INFO Partition [Wmt_Saturday_234,10] on broker
0: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)*




On Tue, Oct 20, 2015 at 9:45 AM, Shaun Senecal <shaun.sene...@lithium.com>
wrote:

> I can't say this is the same issue, but it sounds similar to a situation
> we experienced with Kafka 0.8.2.[1-2].  After restarting a broker, the
> cluster would never really recover (ISRs constantly changing, replication
> failing, etc).  We found the only way to fully recover the cluster was to
> stop all producers and consumers, restart the kafka cluster, the once the
> cluster was back up, restart the producers/consumers.  Obviously thats not
> acceptable for a production cluster, but that was the only thing we could
> find that would get us going again.
>
>
> Shaun
>
> ________________________________________
> From: Szymon Sobczak <szymon.sobc...@getbase.com>
> Sent: October 19, 2015 9:52 PM
> To: users@kafka.apache.org
> Cc: Big Data
> Subject: It's 5.41am, we're after 20+ hours of debugging our prod cluster.
> See NotAssignedReplicaException and UnknownException errors. Help?
>
> Hi!
>
> We're running a 5-machine production Kafka cluster on version 0.8.1.1.
> Yesterday we had some disk problems on one of the replicas and decided to
> replace that node with a clean one. That's when we started experiencing
> many different problems:
>
> - partition replicas are still assigned to the old node and we can't remove
> it form the replica list
> - replicas are lagging behind, most of the topics have only one ISR
> - most of the leaders are on a single node
> - CPU load on the machines is constantly high
>
> We've tried to rebalance the cluster by moving the leaders, decreasing
> number of replicas and some others, but it doesn't seem to help. In the
> meantime I've noticed very weird errors in the kafka.log
>
> For partition 0 of topic product_templates with the following description:
>
> Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
> Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
> 135,68,163
> Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
> 155,68,164
>
> On machine 135 (which is a leader of product_templates,0) in kafka.log I
> see:
>
> kafka.common.NotAssignedReplicaException: Leader 135 failed to record
> follower 155's position 0 for partition [product_templates,0] since the
> replica 155 is not recognized to be one of the assigned replicas 68,163,135
> for partition [product_templates,0]
>
> And the complimentary, on 155 - NOT a replica product_templates,0:
>
> ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
> kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
> partition [product_templates,0] to broker 135:class
> kafka.common.UnknownException
>
> Both of those happen for multiple topics, on multiple machines. Every
> single one happens multiple times per second...
>
> How to approach this? Any help is appreciated!
>
> Thanks!
> Szymon.
>

Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Reply via email to