Have you looked at https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whypartitionleadersmigratethemselvessometimes ?
Thanks, Jun On Sat, Feb 22, 2014 at 2:06 PM, Andrew Otto <o...@wikimedia.org> wrote: > Yeah, I can do that, but I'd prefer if the first broker didn't drop out of > the ISR in the first place. Just trying to figure out why it did... > > > On Feb 21, 2014, at 11:30 PM, Jun Rao <jun...@gmail.com> wrote: > > > So, it sounds like you want the leader to be moved back to the failed > > broker that has caught up. For now, you can use this tool ( > > > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-2.PreferredReplicaLeaderElectionTool > ). > > In 0.8.1 release, we have an option to balance the leaders automatically > > every configurable period of time. > > > > Thanks, > > > > Jun > > > > > > On Fri, Feb 21, 2014 at 10:22 AM, Andrew Otto <o...@wikimedia.org> > wrote: > > > >> Hi all, > >> > >> This has happened a couple of times to me now in the past month, and I'm > >> not entirely sure of the cause, although I have a suspicion. > >> > >> Early this morning (UTC), it looks like one of my two brokers (id 21) > lost > >> its connection to Zookeeper for a very short period of time. This > caused > >> the second broker (id 22) to quickly become the leader for all > partitions. > >> Once broker 21 was able to re-establish its Zookeeper connection, it > >> noticed that it has a stale list for the ISR, got its updated list, and > >> started replicating from broker 22 for all partitions. Broker 21 then > >> quickly rejoined the ISR, but annoyingly (but expectedly), broker 22 > >> remained the leader. All of this happened in under a minute. > >> > >> I'm wondering if https://issues.apache.org/jira/browse/KAFKA-766 is > >> related. The current batch size on our producers is 6000 msgs or 1000 > ms > >> (I've been meaning to reduce this). We do about 6000 msgs per second / > per > >> producer, and have 10 partitions in this relevant topic. A couple of > days > >> ago, we noticed flapping ISR Shrink/Expand logs, so I upped > >> replica.lag.max.messages to 10000, so that it would surely be above our > >> batch size. I still occasionally see flapping ISR Shrinks/Expands, but > >> hope that when I reduce the producer batch size, I will stop seeing > these. > >> > >> Anyway, I'm not entirely sure what happened here. Could flapping ISRs > >> potentially cause this? > >> > >> For reference, the relevant logs from my brokers and a zookeeper are > here: > >> https://gist.github.com/ottomata/9139443 > >> > >> Thanks! > >> -Andrew Otto > >> > >> > >> > >