Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Szymon Sobczak Mon, 19 Oct 2015 22:18:08 -0700

What I tried so far:

- reassigning leader to other machine:
   - found a partition, where not a first replica was a leader and the
error appeared
   - ran the kafka-preferred-replica-election.sh script for that partition
   - checked the logs of the new leader - the same NotAssignedReplicaException
errors started appearing there
   - checked logs of the stubborn non-replica - the same UnknownException
was appearing, but it included the new leader


- adding the stubborn follower to Replicas
   - ran kafka-reassign-partitions.sh script adding it to Replicas
   - ran kafka-topics.sh  --describe to make sure it's added - it was
   - checked logs of the stubborn non-replica - the same UnknownException
was appearing
   - checked leader logs - now I see bigger errors -
http://pastebin.com/uSRrXa8A, related to other partition, causing the
entire request to fail

Now I cannot undo adding 155 to the partitions list - I ran
kafka-reassign-partitions.sh
with the original description of the partition and now running --verify
returns:

Status of partition reassignment:
ERROR: Assigned replicas (135,163,68,155) don't match the list of replicas
for reassignment (135,163,68) for partition [product_templates,0]
Reassignment of partition [product_templates,0] failed

Why can this fail?

Thanks for looking!
S.


On Mon, Oct 19, 2015 at 9:52 PM, Szymon Sobczak <szymon.sobc...@getbase.com>
wrote:

> Hi!
>
> We're running a 5-machine production Kafka cluster on version 0.8.1.1.
> Yesterday we had some disk problems on one of the replicas and decided to
> replace that node with a clean one. That's when we started experiencing
> many different problems:
>
> - partition replicas are still assigned to the old node and we can't
> remove it form the replica list
> - replicas are lagging behind, most of the topics have only one ISR
> - most of the leaders are on a single node
> - CPU load on the machines is constantly high
>
> We've tried to rebalance the cluster by moving the leaders, decreasing
> number of replicas and some others, but it doesn't seem to help. In the
> meantime I've noticed very weird errors in the kafka.log
>
> For partition 0 of topic product_templates with the following description:
>
> Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
> Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
> 135,68,163
> Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
> 155,68,164
>
> On machine 135 (which is a leader of product_templates,0) in kafka.log I
> see:
>
> kafka.common.NotAssignedReplicaException: Leader 135 failed to record
> follower 155's position 0 for partition [product_templates,0] since the
> replica 155 is not recognized to be one of the assigned replicas 68,163,135
> for partition [product_templates,0]
>
> And the complimentary, on 155 - NOT a replica product_templates,0:
>
> ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
> kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
> partition [product_templates,0] to broker 135:class
> kafka.common.UnknownException
>
> Both of those happen for multiple topics, on multiple machines. Every
> single one happens multiple times per second...
>
> How to approach this? Any help is appreciated!
>
> Thanks!
> Szymon.
>

Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Reply via email to