Hey Stephen, two things on that.

1) You need to figure out what is the root cause making the leader election
occur. Could be the brokers are having ZK timeouts and leader election is
occurring as result... if so you need to dig into why (look at all your
logs... You should look for some type of flapping in your monitoring system
metrics that match the time the leader change happens.

2) After this does happen you can run bin/kafka-preferred-replica-election.sh
--zookeeper $zklist which will make the preferred replicas the leader again
for the entire cluster and every topic.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Fri, Sep 12, 2014 at 9:04 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> i find this situation occurs frequently in my setup - only takes one day -
> and blam - the leader board is all skewed to a single one.  not really sure
> to overcome that once it happens so if there is a solution out there i'd be
> interested.
>
> On Fri, Sep 12, 2014 at 12:50 PM, Cory Watson <gp...@keen.io> wrote:
>
> > What follows is a guess on my part, but here's what I *think* was
> > happening:
> >
> > We hit an OOM that seems to've killed some of the replica fetcher
> threads.
> > I had a mishmash of replicas that weren't making progress as determined
> by
> > the JMX stats for the replica. The thread for which the JMX attribute was
> > named was also not running in the JVM…
> >
> > We ended up having to roll through the cluster and increase the heap from
> > 1G to 4G. This was pretty brutal since neither our readers (storm spout)
> or
> > our writers (python) dealt well with leadership changes.
> >
> > Upside is that things are hunky dory again. This was a failure on my part
> > to monitor the under replicated partitions, which would've detected this
> > far sooner.
> >
> > On Fri, Sep 12, 2014 at 12:42 PM, Kashyap Paidimarri <kashy...@gmail.com
> >
> > wrote:
> >
> > > We're seeing the same behaviour today on our cluster. It is not like a
> > > single broker went out of the cluster, rather a few partitions seem
> lazy
> > on
> > > every broker.
> > >
> > > On Fri, Sep 12, 2014 at 9:31 PM, Cory Watson <gp...@keen.io> wrote:
> > >
> > > > I noticed this morning that a few of our partitions do not have their
> > > full
> > > > complement of ISRs:
> > > >
> > > > Topic:migration PartitionCount:16 ReplicationFactor:3
> > > > Configs:retention.bytes=32985348833280
> > > > Topic: migration Partition: 0 Leader: 1 Replicas: 1,4,5 Isr: 1,5,4
> > > > Topic: migration Partition: 1 Leader: 1 Replicas: 2,5,1 Isr: 1,5
> > > > Topic: migration Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2
> > > > Topic: migration Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 4,2
> > > > Topic: migration Partition: 4 Leader: 5 Replicas: 5,3,4 Isr: 3,5,4
> > > > Topic: migration Partition: 5 Leader: 1 Replicas: 1,5,2 Isr: 1,5
> > > > Topic: migration Partition: 6 Leader: 2 Replicas: 2,1,3 Isr: 1,2
> > > > Topic: migration Partition: 7 Leader: 3 Replicas: 3,2,4 Isr: 2,4,3
> > > > Topic: migration Partition: 8 Leader: 4 Replicas: 4,3,5 Isr: 4,5
> > > > Topic: migration Partition: 9 Leader: 5 Replicas: 5,4,1 Isr: 1,5,4
> > > > Topic: migration Partition: 10 Leader: 1 Replicas: 1,2,3 Isr: 1,2
> > > > Topic: migration Partition: 11 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4
> > > > Topic: migration Partition: 12 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5
> > > > Topic: migration Partition: 13 Leader: 4 Replicas: 4,5,1 Isr: 1,5,4
> > > > Topic: migration Partition: 14 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5
> > > > Topic: migration Partition: 15 Leader: 1 Replicas: 1,3,4 Isr: 1,4
> > > >
> > > > I'm a bit confused by partitions with only 2 ISRs, yet that same
> broker
> > > is
> > > > leading other healthy partitions.
> > > >
> > > > What is the appropriate way to kick a broker into re-syncing? I see
> > lots
> > > of
> > > > chatter on docs and the mailing list about watching for this but from
> > > what
> > > > I can find it's supposed to come back in to sync. Mine aren't.
> > > >
> > > > I considered just restarting the affected brokers (3 and 2 in this
> > > example)
> > > > but thought I'd ask first.
> > > >
> > > > --
> > > > Cory Watson
> > > > Principal Infrastructure Engineer // Keen IO
> > > >
> > >
> > >
> > >
> > > --
> > > “The difference between ramen and varelse is not in the creature
> judged,
> > > but in the creature judging. When we declare an alien species to be
> > ramen,
> > > it does not mean that *they* have passed a threshold of moral maturity.
> > It
> > > means that *we* have.
> > >
> > >     —Demosthenes, *Letter to the Framlings*
> > > ”
> > >
> >
> >
> >
> > --
> > Cory Watson
> > Principal Infrastructure Engineer // Keen IO
> >
>

Reply via email to