Hi all,

This has happened a couple of times to me now in the past month, and I’m not 
entirely sure of the cause, although I have a suspicion.

Early this morning (UTC), it looks like one of my two brokers (id 21) lost its 
connection to Zookeeper for a very short period of time.  This caused the 
second broker (id 22) to quickly become the leader for all partitions.  Once 
broker 21 was able to re-establish its Zookeeper connection, it noticed that it 
has a stale list for the ISR, got its updated list, and started replicating 
from broker 22 for all partitions.  Broker 21 then quickly rejoined the ISR, 
but annoyingly (but expectedly), broker 22 remained the leader.  All of this 
happened in under a minute.

I’m wondering if https://issues.apache.org/jira/browse/KAFKA-766 is related.  
The current batch size on our producers is 6000 msgs or 1000 ms (I’ve been 
meaning to reduce this).  We do about 6000 msgs per second / per producer, and 
have 10 partitions in this relevant topic.  A couple of days ago, we noticed 
flapping ISR Shrink/Expand logs, so I upped replica.lag.max.messages to 10000, 
so that it would surely be above our batch size.  I still occasionally see 
flapping ISR Shrinks/Expands, but hope that when I reduce the producer batch 
size, I will stop seeing these.

Anyway, I’m not entirely sure what happened here.  Could flapping ISRs 
potentially cause this?

For reference, the relevant logs from my brokers and a zookeeper are here: 
https://gist.github.com/ottomata/9139443

Thanks!
-Andrew Otto


Reply via email to