I've seen this before and it was due to long GC pauses due in large part to a memory heap > 8 GB.
--John On Thu, Nov 9, 2017 at 8:17 AM, Json Tu <kafka...@126.com> wrote: > Hi, > we have a kafka cluster which is made of 6 brokers, with 8 cpu and > 16G memory on each broker’s machine, and we have about 1600 topics in the > cluster,about 1700 partitions’ leader and 1600 partitions' replica on each > broker. > when we restart a normal broke, we find that there are 500+ > partitions shrink and expand frequently when restart the broker, > there are many logs as below. > > [2017-11-09 17:05:51,173] INFO Partition [Yelp,5] on broker 4759726: > Expanding ISR for partition [Yelp,5] from 4759726 to 4759726,4759750 > (kafka.cluster.Partition) > [2017-11-09 17:06:22,047] INFO Partition [Yelp,5] on broker 4759726: > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to 4759726 > (kafka.cluster.Partition) > [2017-11-09 17:06:28,634] INFO Partition [Yelp,5] on broker 4759726: > Expanding ISR for partition [Yelp,5] from 4759726 to 4759726,4759750 > (kafka.cluster.Partition) > [2017-11-09 17:06:44,658] INFO Partition [Yelp,5] on broker 4759726: > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to 4759726 > (kafka.cluster.Partition) > [2017-11-09 17:06:47,611] INFO Partition [Yelp,5] on broker 4759726: > Expanding ISR for partition [Yelp,5] from 4759726 to 4759726,4759750 > (kafka.cluster.Partition) > [2017-11-09 17:07:19,703] INFO Partition [Yelp,5] on broker 4759726: > Shrinking ISR for partition [Yelp,5] from 4759726,4759750 to 4759726 > (kafka.cluster.Partition) > [2017-11-09 17:07:26,811] INFO Partition [Yelp,5] on broker 4759726: > Expanding ISR for partition [Yelp,5] from 4759726 to 4759726,4759750 > (kafka.cluster.Partition) > … > > > and repeat shrink and expand after 30 minutes which is the default > value of leader.imbalance.check.interval.seconds, and at that time > we can find the log of controller’s auto rebalance,which can leads some > partition’s leader change to this restarted broker. > we have no shrink and expand when our cluster is running except when > we restart it,so replica.fetch.thread.num is 1,and it seems enough. > > we can reproduce it at each restart,can someone give some suggestions. > thanks before. > > > > > > > >