Look at producer purgatory size.  Anything greater than 10 is bad (from my 
experience).  Keeping it under 4 seemed to help us.  (i.e. if a broker is 
getting slammed with write, use rebalance tools or add a new broker).  Also 
check network latency and/or adjust timeout for ISR checking.  If on AWS, make 
sure to enable “enhanced networking” (aka: networking that doesn’t suck)

On 3/22/17, 3:39 PM, "Jun MA" <mj.saber1...@gmail.com> wrote:

    Let me know if this fix your issue! I’d really interesting to know based on 
what information should we decide to expand the cluster- bytes per seconds or 
number of partitions on each broker? And what is the limitation.
    
    
    > On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mjua...@gmail.com> wrote:
    > 
    > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on 
partitions with the largest volume, with thousands of messages per second.  It 
happens at least a hundred times a day in our production cluster. We do have 
hundreds of topics in our cluster, most of them with 20 or more partitions, but 
most of them see only a few hundred messages per minute.  
    > 
    > We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1 
version would fix the issue, but we've already deployed that version to our 
staging cluster, and we're seeing the same problem.  We still haven't tried out 
the latest 0.10.2.0 version, but I don't see any evidence pointing to a fix for 
that.
    > 
    > This ticket seems to have some similar details, but it doesn't seem like 
there has been follow-up, and there's no target release for fixing:
    > 
    > https://issues.apache.org/jira/browse/KAFKA-4674 
<https://issues.apache.org/jira/browse/KAFKA-4674>
    > 
    > Jun Ma, what exactly did you do to failover the controller to a new 
broker? If that works for you, I'd like to try it in our staging clusters.
    > 
    > Thanks,
    > 
    > Marcos Juarez
    > 
    > 
    > 
    > 
    > 
    > On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1...@gmail.com 
<mailto:mj.saber1...@gmail.com>> wrote:
    > I have similar issue with our cluster. We don’t know the root cause but 
we have some interesting observation.
    > 
    > 1. We do see correlation between ISR churn and fetcher connection 
close/create.
    > 
    > 
    > 2. We’ve tried to add a broker which doesn’t have any partitions on it 
dedicate to the controller (rolling restart existing brokers and failover the 
controller to the newly added broker), and that indeed eliminate the random ISR 
churn. We have a cluster of 6 brokers (besides the dedicated controller) and 
each one has about 300 partitions on it. I suspect that kafka broker cannot 
handle running controller + 300 partitions.
    > 
    > Anyway that’s so far what I got, I’d also like to know how to debug this.
    > We’re running kafka 0.9.0.1 with heap size 8G.
    > 
    > Thanks,
    > Jun
    > 
    >> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.re...@gmail.com 
<mailto:manikumar.re...@gmail.com>> wrote:
    >> 
    >> Any erros related to zookeeper seesion timeout? We can also check for
    >> excesssive GC.
    >> Some times this may due to forming multiple controllers due to soft
    >> failures.
    >> You can check ActiveControllerCount on brokers, only one broker in the
    >> cluster should have 1.
    >> Also check for network issues/partitions
    >> 
    >> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradu...@gmail.com 
<mailto:rradu...@gmail.com>> wrote:
    >> 
    >>> Hello,
    >>> 
    >>> Does anyone know how I can debug high ISR churn on the kafka leader on a
    >>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
    >>> and replica 3) and both show constant changes of the number of insync
    >>> replicas:
    >>> 
    >>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on 
broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
    >>> (kafka.cluster.Partition)
    >>> 
    >>> and
    >>> 
    >>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: 
Shrinking
    >>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
    >>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: 
Expanding
    >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
    >>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: 
Shrinking
    >>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
    >>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: 
Expanding
    >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
    >>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: 
Expanding
    >>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 
(kafka.cluster.Partition)
    >>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: 
Shrinking
    >>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: 
Expanding
    >>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 
(kafka.cluster.Partition)
    >>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: 
Shrinking
    >>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: 
Expanding
    >>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 
(kafka.cluster.Partition)
    >>> (END)
    >>> 
    >>> I have tried increasing the num.network.threads (now 8) and
    >>> num.replica.fetchers (now 2) but nothing has changed.
    >>> 
    >>> The kafka server config is:
    >>> 
    >>> default.replication.factor=4
    >>> log.retention.check.interval.ms 
<http://log.retention.check.interval.ms/>=300000
    >>> log.retention.hours=168
    >>> log.roll.hours=24
    >>> log.segment.bytes=104857600
    >>> min.insync.replicas=2
    >>> num.io.threads=8
    >>> num.network.threads=15
    >>> num.partitions=1
    >>> num.recovery.threads.per.data.dir=1
    >>> num.replica.fetchers=2
    >>> offsets.topic.num.partitions=1
    >>> offsets.topic.replication.factor=3
    >>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
    >>> socket.receive.buffer.bytes=102400
    >>> socket.request.max.bytes=104857600
    >>> socket.send.buffer.bytes=102400
    >>> unclean.leader.election.enable=false
    >>> zookeeper.connection.timeout.ms 
<http://zookeeper.connection.timeout.ms/>=3000
    >>> 
    >>> Best regards,
    >>> Radu
    >>> 
    > 
    > 
    
    

Reply via email to