i find this situation occurs frequently in my setup - only takes one day -
and blam - the leader board is all skewed to a single one.  not really sure
to overcome that once it happens so if there is a solution out there i'd be
interested.

On Fri, Sep 12, 2014 at 12:50 PM, Cory Watson <gp...@keen.io> wrote:

> What follows is a guess on my part, but here's what I *think* was
> happening:
>
> We hit an OOM that seems to've killed some of the replica fetcher threads.
> I had a mishmash of replicas that weren't making progress as determined by
> the JMX stats for the replica. The thread for which the JMX attribute was
> named was also not running in the JVM…
>
> We ended up having to roll through the cluster and increase the heap from
> 1G to 4G. This was pretty brutal since neither our readers (storm spout) or
> our writers (python) dealt well with leadership changes.
>
> Upside is that things are hunky dory again. This was a failure on my part
> to monitor the under replicated partitions, which would've detected this
> far sooner.
>
> On Fri, Sep 12, 2014 at 12:42 PM, Kashyap Paidimarri <kashy...@gmail.com>
> wrote:
>
> > We're seeing the same behaviour today on our cluster. It is not like a
> > single broker went out of the cluster, rather a few partitions seem lazy
> on
> > every broker.
> >
> > On Fri, Sep 12, 2014 at 9:31 PM, Cory Watson <gp...@keen.io> wrote:
> >
> > > I noticed this morning that a few of our partitions do not have their
> > full
> > > complement of ISRs:
> > >
> > > Topic:migration PartitionCount:16 ReplicationFactor:3
> > > Configs:retention.bytes=32985348833280
> > > Topic: migration Partition: 0 Leader: 1 Replicas: 1,4,5 Isr: 1,5,4
> > > Topic: migration Partition: 1 Leader: 1 Replicas: 2,5,1 Isr: 1,5
> > > Topic: migration Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2
> > > Topic: migration Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 4,2
> > > Topic: migration Partition: 4 Leader: 5 Replicas: 5,3,4 Isr: 3,5,4
> > > Topic: migration Partition: 5 Leader: 1 Replicas: 1,5,2 Isr: 1,5
> > > Topic: migration Partition: 6 Leader: 2 Replicas: 2,1,3 Isr: 1,2
> > > Topic: migration Partition: 7 Leader: 3 Replicas: 3,2,4 Isr: 2,4,3
> > > Topic: migration Partition: 8 Leader: 4 Replicas: 4,3,5 Isr: 4,5
> > > Topic: migration Partition: 9 Leader: 5 Replicas: 5,4,1 Isr: 1,5,4
> > > Topic: migration Partition: 10 Leader: 1 Replicas: 1,2,3 Isr: 1,2
> > > Topic: migration Partition: 11 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4
> > > Topic: migration Partition: 12 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5
> > > Topic: migration Partition: 13 Leader: 4 Replicas: 4,5,1 Isr: 1,5,4
> > > Topic: migration Partition: 14 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5
> > > Topic: migration Partition: 15 Leader: 1 Replicas: 1,3,4 Isr: 1,4
> > >
> > > I'm a bit confused by partitions with only 2 ISRs, yet that same broker
> > is
> > > leading other healthy partitions.
> > >
> > > What is the appropriate way to kick a broker into re-syncing? I see
> lots
> > of
> > > chatter on docs and the mailing list about watching for this but from
> > what
> > > I can find it's supposed to come back in to sync. Mine aren't.
> > >
> > > I considered just restarting the affected brokers (3 and 2 in this
> > example)
> > > but thought I'd ask first.
> > >
> > > --
> > > Cory Watson
> > > Principal Infrastructure Engineer // Keen IO
> > >
> >
> >
> >
> > --
> > “The difference between ramen and varelse is not in the creature judged,
> > but in the creature judging. When we declare an alien species to be
> ramen,
> > it does not mean that *they* have passed a threshold of moral maturity.
> It
> > means that *we* have.
> >
> >     —Demosthenes, *Letter to the Framlings*
> > ”
> >
>
>
>
> --
> Cory Watson
> Principal Infrastructure Engineer // Keen IO
>

Reply via email to