i find this situation occurs frequently in my setup - only takes one day - and blam - the leader board is all skewed to a single one. not really sure to overcome that once it happens so if there is a solution out there i'd be interested.
On Fri, Sep 12, 2014 at 12:50 PM, Cory Watson <gp...@keen.io> wrote: > What follows is a guess on my part, but here's what I *think* was > happening: > > We hit an OOM that seems to've killed some of the replica fetcher threads. > I had a mishmash of replicas that weren't making progress as determined by > the JMX stats for the replica. The thread for which the JMX attribute was > named was also not running in the JVM… > > We ended up having to roll through the cluster and increase the heap from > 1G to 4G. This was pretty brutal since neither our readers (storm spout) or > our writers (python) dealt well with leadership changes. > > Upside is that things are hunky dory again. This was a failure on my part > to monitor the under replicated partitions, which would've detected this > far sooner. > > On Fri, Sep 12, 2014 at 12:42 PM, Kashyap Paidimarri <kashy...@gmail.com> > wrote: > > > We're seeing the same behaviour today on our cluster. It is not like a > > single broker went out of the cluster, rather a few partitions seem lazy > on > > every broker. > > > > On Fri, Sep 12, 2014 at 9:31 PM, Cory Watson <gp...@keen.io> wrote: > > > > > I noticed this morning that a few of our partitions do not have their > > full > > > complement of ISRs: > > > > > > Topic:migration PartitionCount:16 ReplicationFactor:3 > > > Configs:retention.bytes=32985348833280 > > > Topic: migration Partition: 0 Leader: 1 Replicas: 1,4,5 Isr: 1,5,4 > > > Topic: migration Partition: 1 Leader: 1 Replicas: 2,5,1 Isr: 1,5 > > > Topic: migration Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2 > > > Topic: migration Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 4,2 > > > Topic: migration Partition: 4 Leader: 5 Replicas: 5,3,4 Isr: 3,5,4 > > > Topic: migration Partition: 5 Leader: 1 Replicas: 1,5,2 Isr: 1,5 > > > Topic: migration Partition: 6 Leader: 2 Replicas: 2,1,3 Isr: 1,2 > > > Topic: migration Partition: 7 Leader: 3 Replicas: 3,2,4 Isr: 2,4,3 > > > Topic: migration Partition: 8 Leader: 4 Replicas: 4,3,5 Isr: 4,5 > > > Topic: migration Partition: 9 Leader: 5 Replicas: 5,4,1 Isr: 1,5,4 > > > Topic: migration Partition: 10 Leader: 1 Replicas: 1,2,3 Isr: 1,2 > > > Topic: migration Partition: 11 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4 > > > Topic: migration Partition: 12 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5 > > > Topic: migration Partition: 13 Leader: 4 Replicas: 4,5,1 Isr: 1,5,4 > > > Topic: migration Partition: 14 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5 > > > Topic: migration Partition: 15 Leader: 1 Replicas: 1,3,4 Isr: 1,4 > > > > > > I'm a bit confused by partitions with only 2 ISRs, yet that same broker > > is > > > leading other healthy partitions. > > > > > > What is the appropriate way to kick a broker into re-syncing? I see > lots > > of > > > chatter on docs and the mailing list about watching for this but from > > what > > > I can find it's supposed to come back in to sync. Mine aren't. > > > > > > I considered just restarting the affected brokers (3 and 2 in this > > example) > > > but thought I'd ask first. > > > > > > -- > > > Cory Watson > > > Principal Infrastructure Engineer // Keen IO > > > > > > > > > > > -- > > “The difference between ramen and varelse is not in the creature judged, > > but in the creature judging. When we declare an alien species to be > ramen, > > it does not mean that *they* have passed a threshold of moral maturity. > It > > means that *we* have. > > > > —Demosthenes, *Letter to the Framlings* > > ” > > > > > > -- > Cory Watson > Principal Infrastructure Engineer // Keen IO >