Re: Getting replicas back in sync

2014-09-12 Thread Joe Stein
Hey Stephen, two things on that. 1) You need to figure out what is the root cause making the leader election occur. Could be the brokers are having ZK timeouts and leader election is occurring as result... if so you need to dig into why (look at all your logs... You should look for some type of fl

Re: Getting replicas back in sync

2014-09-12 Thread Stephen Sprague
i find this situation occurs frequently in my setup - only takes one day - and blam - the leader board is all skewed to a single one. not really sure to overcome that once it happens so if there is a solution out there i'd be interested. On Fri, Sep 12, 2014 at 12:50 PM, Cory Watson wrote: > Wh

Re: Getting replicas back in sync

2014-09-12 Thread Cory Watson
What follows is a guess on my part, but here's what I *think* was happening: We hit an OOM that seems to've killed some of the replica fetcher threads. I had a mishmash of replicas that weren't making progress as determined by the JMX stats for the replica. The thread for which the JMX attribute w

Re: Getting replicas back in sync

2014-09-12 Thread Kashyap Paidimarri
We're seeing the same behaviour today on our cluster. It is not like a single broker went out of the cluster, rather a few partitions seem lazy on every broker. On Fri, Sep 12, 2014 at 9:31 PM, Cory Watson wrote: > I noticed this morning that a few of our partitions do not have their full > comp

Getting replicas back in sync

2014-09-12 Thread Cory Watson
I noticed this morning that a few of our partitions do not have their full complement of ISRs: Topic:migration PartitionCount:16 ReplicationFactor:3 Configs:retention.bytes=32985348833280 Topic: migration Partition: 0 Leader: 1 Replicas: 1,4,5 Isr: 1,5,4 Topic: migration Partition: 1 Leader: 1 Rep