Same here.. Started running into similar situations almost weekly ever since we increased the number of partitions on some topics from 6 to 15 and added 3 brokers to our kafka cluster.
Last night I stopped all producers & consumers; restarted the brokers & zookeepers and restarted producers/consumers. Today morning I see an endless loop of shrinking ISR -> cached zkVersion mismatch skip udating ISR again. *[2015-11-07 11:55:47,260] INFO Partition [Wmt_Saturday_234,10] on broker 0: Shrinking ISR for partition [Wmt_Saturday_234,10] from 0,1 to 0 (kafka.cluster.Partition)* *[2015-11-07 11:55:47,267] INFO Partition [Wmt_Saturday_234,10] on broker 0: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)* On Tue, Oct 20, 2015 at 9:45 AM, Shaun Senecal <shaun.sene...@lithium.com> wrote: > I can't say this is the same issue, but it sounds similar to a situation > we experienced with Kafka 0.8.2.[1-2]. After restarting a broker, the > cluster would never really recover (ISRs constantly changing, replication > failing, etc). We found the only way to fully recover the cluster was to > stop all producers and consumers, restart the kafka cluster, the once the > cluster was back up, restart the producers/consumers. Obviously thats not > acceptable for a production cluster, but that was the only thing we could > find that would get us going again. > > > Shaun > > ________________________________________ > From: Szymon Sobczak <szymon.sobc...@getbase.com> > Sent: October 19, 2015 9:52 PM > To: users@kafka.apache.org > Cc: Big Data > Subject: It's 5.41am, we're after 20+ hours of debugging our prod cluster. > See NotAssignedReplicaException and UnknownException errors. Help? > > Hi! > > We're running a 5-machine production Kafka cluster on version 0.8.1.1. > Yesterday we had some disk problems on one of the replicas and decided to > replace that node with a clean one. That's when we started experiencing > many different problems: > > - partition replicas are still assigned to the old node and we can't remove > it form the replica list > - replicas are lagging behind, most of the topics have only one ISR > - most of the leaders are on a single node > - CPU load on the machines is constantly high > > We've tried to rebalance the cluster by moving the leaders, decreasing > number of replicas and some others, but it doesn't seem to help. In the > meantime I've noticed very weird errors in the kafka.log > > For partition 0 of topic product_templates with the following description: > > Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs: > Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr: > 135,68,163 > Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr: > 155,68,164 > > On machine 135 (which is a leader of product_templates,0) in kafka.log I > see: > > kafka.common.NotAssignedReplicaException: Leader 135 failed to record > follower 155's position 0 for partition [product_templates,0] since the > replica 155 is not recognized to be one of the assigned replicas 68,163,135 > for partition [product_templates,0] > > And the complimentary, on 155 - NOT a replica product_templates,0: > > ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala > kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for > partition [product_templates,0] to broker 135:class > kafka.common.UnknownException > > Both of those happen for multiple topics, on multiple machines. Every > single one happens multiple times per second... > > How to approach this? Any help is appreciated! > > Thanks! > Szymon. >