What I tried so far: - reassigning leader to other machine: - found a partition, where not a first replica was a leader and the error appeared - ran the kafka-preferred-replica-election.sh script for that partition - checked the logs of the new leader - the same NotAssignedReplicaException errors started appearing there - checked logs of the stubborn non-replica - the same UnknownException was appearing, but it included the new leader
- adding the stubborn follower to Replicas - ran kafka-reassign-partitions.sh script adding it to Replicas - ran kafka-topics.sh --describe to make sure it's added - it was - checked logs of the stubborn non-replica - the same UnknownException was appearing - checked leader logs - now I see bigger errors - http://pastebin.com/uSRrXa8A, related to other partition, causing the entire request to fail Now I cannot undo adding 155 to the partitions list - I ran kafka-reassign-partitions.sh with the original description of the partition and now running --verify returns: Status of partition reassignment: ERROR: Assigned replicas (135,163,68,155) don't match the list of replicas for reassignment (135,163,68) for partition [product_templates,0] Reassignment of partition [product_templates,0] failed Why can this fail? Thanks for looking! S. On Mon, Oct 19, 2015 at 9:52 PM, Szymon Sobczak <szymon.sobc...@getbase.com> wrote: > Hi! > > We're running a 5-machine production Kafka cluster on version 0.8.1.1. > Yesterday we had some disk problems on one of the replicas and decided to > replace that node with a clean one. That's when we started experiencing > many different problems: > > - partition replicas are still assigned to the old node and we can't > remove it form the replica list > - replicas are lagging behind, most of the topics have only one ISR > - most of the leaders are on a single node > - CPU load on the machines is constantly high > > We've tried to rebalance the cluster by moving the leaders, decreasing > number of replicas and some others, but it doesn't seem to help. In the > meantime I've noticed very weird errors in the kafka.log > > For partition 0 of topic product_templates with the following description: > > Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs: > Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr: > 135,68,163 > Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr: > 155,68,164 > > On machine 135 (which is a leader of product_templates,0) in kafka.log I > see: > > kafka.common.NotAssignedReplicaException: Leader 135 failed to record > follower 155's position 0 for partition [product_templates,0] since the > replica 155 is not recognized to be one of the assigned replicas 68,163,135 > for partition [product_templates,0] > > And the complimentary, on 155 - NOT a replica product_templates,0: > > ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala > kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for > partition [product_templates,0] to broker 135:class > kafka.common.UnknownException > > Both of those happen for multiple topics, on multiple machines. Every > single one happens multiple times per second... > > How to approach this? Any help is appreciated! > > Thanks! > Szymon. >