Hi! We're running a 5-machine production Kafka cluster on version 0.8.1.1. Yesterday we had some disk problems on one of the replicas and decided to replace that node with a clean one. That's when we started experiencing many different problems:
- partition replicas are still assigned to the old node and we can't remove it form the replica list - replicas are lagging behind, most of the topics have only one ISR - most of the leaders are on a single node - CPU load on the machines is constantly high We've tried to rebalance the cluster by moving the leaders, decreasing number of replicas and some others, but it doesn't seem to help. In the meantime I've noticed very weird errors in the kafka.log For partition 0 of topic product_templates with the following description: Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs: Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr: 135,68,163 Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr: 155,68,164 On machine 135 (which is a leader of product_templates,0) in kafka.log I see: kafka.common.NotAssignedReplicaException: Leader 135 failed to record follower 155's position 0 for partition [product_templates,0] since the replica 155 is not recognized to be one of the assigned replicas 68,163,135 for partition [product_templates,0] And the complimentary, on 155 - NOT a replica product_templates,0: ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for partition [product_templates,0] to broker 135:class kafka.common.UnknownException Both of those happen for multiple topics, on multiple machines. Every single one happens multiple times per second... How to approach this? Any help is appreciated! Thanks! Szymon.