Any erros related to zookeeper seesion timeout? We can also check for excesssive GC. Some times this may due to forming multiple controllers due to soft failures. You can check ActiveControllerCount on brokers, only one broker in the cluster should have 1. Also check for network issues/partitions
On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradu...@gmail.com> wrote: > Hello, > > Does anyone know how I can debug high ISR churn on the kafka leader on a > cluster without traffic? I have 2 topics on a 4 node cluster (replica 4 > and replica 3) and both show constant changes of the number of insync > replicas: > > [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker > 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5 > (kafka.cluster.Partition) > [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker > 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5 > (kafka.cluster.Partition) > [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker > 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4 > (kafka.cluster.Partition) > [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker > 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5 > (kafka.cluster.Partition) > [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker > 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4 > (kafka.cluster.Partition) > [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker > 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4 > (kafka.cluster.Partition) > [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker > 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5 > (kafka.cluster.Partition) > > and > > [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking > ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition) > [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding > ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition) > [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking > ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition) > [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding > ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition) > [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding > ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition) > [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking > ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1 > (kafka.cluster.Partition) > [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding > ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition) > [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking > ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2 > (kafka.cluster.Partition) > [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding > ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition) > (END) > > I have tried increasing the num.network.threads (now 8) and > num.replica.fetchers (now 2) but nothing has changed. > > The kafka server config is: > > default.replication.factor=4 > log.retention.check.interval.ms=300000 > log.retention.hours=168 > log.roll.hours=24 > log.segment.bytes=104857600 > min.insync.replicas=2 > num.io.threads=8 > num.network.threads=15 > num.partitions=1 > num.recovery.threads.per.data.dir=1 > num.replica.fetchers=2 > offsets.topic.num.partitions=1 > offsets.topic.replication.factor=3 > replica.lag.time.max.ms=500 > socket.receive.buffer.bytes=102400 > socket.request.max.bytes=104857600 > socket.send.buffer.bytes=102400 > unclean.leader.election.enable=false > zookeeper.connection.timeout.ms=3000 > > Best regards, > Radu >