In addition to "replica.lag.time.max.ms", please also revisit 
"replica.lag.max.messages". Simply increasing the lag time won't prevent ISR 
blips. Since the lag is calculated based on number of messages, a single large 
commit on the leader can throw a follower out of ISR.

Aditya

________________________________________
From: Christofer Hedbrandh [christo...@knewton.com]
Sent: Monday, June 29, 2015 11:26 AM
To: users@kafka.apache.org
Subject: Re: Sporadic Shrinking/Expanding ISR Blips

Hi Guozhang,

It does seem to happen more frequently during higher load, but has also
been observed during hours with very little traffic. Sounds like we might
try increasing "replica.lag.time.max.ms".

Any thoughts on why consumers experience a delay when this happens?

Thanks,
Christofer

On Mon, Jun 29, 2015 at 2:08 PM, Guozhang Wang <wangg...@gmail.com> wrote:

> Hi Christofer,
>
> Does this ISR churn usually come with incoming traffic spikes? If yes I
> would suggest you increase both replica.lag.time.max.ms and
> replica.lag.max.messages and see if it helps. The SocketTimeoutException is
> normal when the followers consumer gets disconnected from the leader.
>
> In the coming 0.8.3 release there is a new featured adde (KAFKA-1546
> <https://issues.apache.org/jira/browse/KAFKA-1546>) that handles traffic
> spikes better.
>
> Guozhang
>
> On Mon, Jun 29, 2015 at 10:19 AM, Christofer Hedbrandh <
> christo...@knewton.com> wrote:
>
> > Hi Kafka users,
> >
> > Every ~3-4 days we are seeing a broker logging "Shrinking ISR for
> > partition..." only to log "Expanding ISR for partition..." a few seconds
> > later.
> >
> > - The broker logging "Shrinking ISR..." is the leader of all partitions
> for
> > which the ISR is shrunk.
> > - The number of partitions is always a multiple of the total number of
> > partitions it is leading, divided by the number of fetcher threads.
> > - There seem to be no particular pattern to which broker kicks which
> broker
> > out of the ISR.
> > - After the broker has logged "Shrinking ISR..." for all affected
> > partitions, it logs "Closing socket connection to..." for the broker that
> > was kicked out, and then "Expanding ISR..." begins.
> > - The broker that got kicked out logs "Reconnect due to socket error:
> > java.net.SocketTimeoutException" when the leader broker closes the
> > connection.
> > - Consumers sometimes show latency spikes when this happens.
> >
> >
> > Example leader log:
> > ...Shrinking...
> > [2015-06-28 05:24:51,141] INFO Partition [my-topic,24] on broker
> 762855627:
> > Shrinking ISR for partition [my-topic,24] from
> > 762855627,729301197,1249886642 to 762855627,729301197
> > (kafka.cluster.Partition)
> > [2015-06-28 05:24:51,149] INFO Partition [my-topic,16] on broker
> 762855627:
> > Shrinking ISR for partition [my-topic,16] from
> > 762855627,729301197,1249886642 to 762855627,729301197
> > (kafka.cluster.Partition)
> > [2015-06-28 05:25:03,295] INFO Closing socket connection to /
> 10.95.197.138
> > .
> > (kafka.network.Processor)
> > [2015-06-28 05:25:03,301] INFO Partition [my-topic,60] on broker
> 762855627:
> > Expanding ISR for partition [my-topic,60] from 762855627,729301197 to
> > 762855627,729301197,1249886642 (kafka.cluster.Partition)
> > [2015-06-28 05:25:03,339] INFO Partition [my-topic,206] on broker
> > 762855627: Expanding ISR for partition [my-topic,206] from
> > 762855627,729301197 to 762855627,729301197,1249886642
> > (kafka.cluster.Partition)
> > ...Expanding...
> >
> > Example follower log:
> > [2015-06-28 05:25:03,299] INFO Reconnect due to socket error:
> > java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer)
> >
> >
> > I am tempted to just increase "replica.lag.time.max.ms" from the default
> > 10
> > seconds that we are currently using, but the SocketTimeoutException is
> > suggesting that there may be something else going on here. It also
> > surprises me that consumers are affected by this, since no broker
> partition
> > leadership re-election is taking place. Any advice on how to proceed
> would
> > be much appreciated.
> >
> >
> > The setup:
> > Three Kafka 0.8.2 brokers running on AWS EC2 c3.2xlarge instances.
> > Three ZooKeepers 3.3.6 running on AWS EC2 m1.large instances.
> >
> > Broker configs:
> > auto.leader.rebalance.enable=true
> > log.segment.bytes=536870912
> > socket.receive.buffer.bytes=1048576
> > socket.send.buffer.bytes=1048576
> > log.flush.scheduler.interval.ms=3000
> > log.flush.interval.ms=3000
> > controlled.shutdown.enable=true
> > delete.topic.enable=true
> > zookeeper.connection.timeout.ms=1000000
> > replica.fetch.max.bytes=10485760
> > num.replica.fetchers=4
> > log.dirs=XXX
> > log.retention.minutes=5256000
> > message.max.bytes=10485760
> > log.flush.interval.messages=10000
> > auto.create.topics.enable=false
> > port=9092
> > zookeeper.connect=XXX
> > broker.id=XXX
> > log.roll.hours=24
> >
> > Peak load:
> > total messages per second: 80
> > total bytes in per second: 110 KiB
> > total bytes out per second: 440 KiB (220 of which is replication)
> >
> >
> > Thanks,
> > Christofer
> >
>
>
>
> --
> -- Guozhang
>

Reply via email to