Re: ISR churn

2017-03-23 Thread Radu Radutiu
I see no errors related to zookeeper. I searched all the logs (kafka and zookeeper) for all 4 servers for all entries in the minute with the ISR change at 08:23:54 . Here are the logs: Node n1 kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE Broker 1 cached leader info (Le

Re: ISR churn

2017-03-22 Thread David Garcia
Sure…there are two types of purgatories: Consumer and Producer Consumer purgatory (for a partition leader) is a queue for pending requests for data (i.e. polling by some client for the respective partition). It’s basically a waiting area for poll requests. Generally speaking, the more consumer

Re: ISR churn

2017-03-22 Thread Jun MA
Hi David, I checked our cluster, the producer purgatory size is under 3 mostly. But I’m not quite understand this metrics, could you please explain it a little bit? Thanks, Jun > On Mar 22, 2017, at 3:07 PM, David Garcia wrote: > > producer purgatory size

Re: ISR churn

2017-03-22 Thread James Cheng
Marcos, Radu, Are either of you seeing messages saying "Cached zkVersion [...] not equal to that in zookeeper"? If so, you may be hitting https://issues.apache.org/jira/browse/KAFKA-3042 Not sure if that helps you, since I haven't been able i

Re: ISR churn

2017-03-22 Thread David Garcia
Look at producer purgatory size. Anything greater than 10 is bad (from my experience). Keeping it under 4 seemed to help us. (i.e. if a broker is getting slammed with write, use rebalance tools or add a new broker). Also check network latency and/or adjust timeout for ISR checking. If on AW

Re: ISR churn

2017-03-22 Thread Jun MA
Let me know if this fix your issue! I’d really interesting to know based on what information should we decide to expand the cluster- bytes per seconds or number of partitions on each broker? And what is the limitation. > On Mar 22, 2017, at 11:46 AM, Marcos Juarez wrote: > > We're seeing the

Re: ISR churn

2017-03-22 Thread Jeff Widman
To manually failover the controller, just delete the /controller znode in zookeeper On Wed, Mar 22, 2017 at 11:46 AM, Marcos Juarez wrote: > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on > partitions with the largest volume, with thousands of messages per second. > It

Re: ISR churn

2017-03-22 Thread Marcos Juarez
We're seeing the same exact pattern of ISR shrinking/resizing, mostly on partitions with the largest volume, with thousands of messages per second. It happens at least a hundred times a day in our production cluster. We do have hundreds of topics in our cluster, most of them with 20 or more partiti

Re: ISR churn

2017-03-22 Thread Manikumar
Any erros related to zookeeper seesion timeout? We can also check for excesssive GC. Some times this may due to forming multiple controllers due to soft failures. You can check ActiveControllerCount on brokers, only one broker in the cluster should have 1. Also check for network issues/partitions