What follows is a guess on my part, but here's what I *think* was happening:

We hit an OOM that seems to've killed some of the replica fetcher threads.
I had a mishmash of replicas that weren't making progress as determined by
the JMX stats for the replica. The thread for which the JMX attribute was
named was also not running in the JVM…

We ended up having to roll through the cluster and increase the heap from
1G to 4G. This was pretty brutal since neither our readers (storm spout) or
our writers (python) dealt well with leadership changes.

Upside is that things are hunky dory again. This was a failure on my part
to monitor the under replicated partitions, which would've detected this
far sooner.

On Fri, Sep 12, 2014 at 12:42 PM, Kashyap Paidimarri <kashy...@gmail.com>
wrote:

> We're seeing the same behaviour today on our cluster. It is not like a
> single broker went out of the cluster, rather a few partitions seem lazy on
> every broker.
>
> On Fri, Sep 12, 2014 at 9:31 PM, Cory Watson <gp...@keen.io> wrote:
>
> > I noticed this morning that a few of our partitions do not have their
> full
> > complement of ISRs:
> >
> > Topic:migration PartitionCount:16 ReplicationFactor:3
> > Configs:retention.bytes=32985348833280
> > Topic: migration Partition: 0 Leader: 1 Replicas: 1,4,5 Isr: 1,5,4
> > Topic: migration Partition: 1 Leader: 1 Replicas: 2,5,1 Isr: 1,5
> > Topic: migration Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2
> > Topic: migration Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 4,2
> > Topic: migration Partition: 4 Leader: 5 Replicas: 5,3,4 Isr: 3,5,4
> > Topic: migration Partition: 5 Leader: 1 Replicas: 1,5,2 Isr: 1,5
> > Topic: migration Partition: 6 Leader: 2 Replicas: 2,1,3 Isr: 1,2
> > Topic: migration Partition: 7 Leader: 3 Replicas: 3,2,4 Isr: 2,4,3
> > Topic: migration Partition: 8 Leader: 4 Replicas: 4,3,5 Isr: 4,5
> > Topic: migration Partition: 9 Leader: 5 Replicas: 5,4,1 Isr: 1,5,4
> > Topic: migration Partition: 10 Leader: 1 Replicas: 1,2,3 Isr: 1,2
> > Topic: migration Partition: 11 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4
> > Topic: migration Partition: 12 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5
> > Topic: migration Partition: 13 Leader: 4 Replicas: 4,5,1 Isr: 1,5,4
> > Topic: migration Partition: 14 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5
> > Topic: migration Partition: 15 Leader: 1 Replicas: 1,3,4 Isr: 1,4
> >
> > I'm a bit confused by partitions with only 2 ISRs, yet that same broker
> is
> > leading other healthy partitions.
> >
> > What is the appropriate way to kick a broker into re-syncing? I see lots
> of
> > chatter on docs and the mailing list about watching for this but from
> what
> > I can find it's supposed to come back in to sync. Mine aren't.
> >
> > I considered just restarting the affected brokers (3 and 2 in this
> example)
> > but thought I'd ask first.
> >
> > --
> > Cory Watson
> > Principal Infrastructure Engineer // Keen IO
> >
>
>
>
> --
> “The difference between ramen and varelse is not in the creature judged,
> but in the creature judging. When we declare an alien species to be ramen,
> it does not mean that *they* have passed a threshold of moral maturity. It
> means that *we* have.
>
>     —Demosthenes, *Letter to the Framlings*
> ”
>



-- 
Cory Watson
Principal Infrastructure Engineer // Keen IO

Reply via email to