Hello Kafka Dev, We need help on lagging issue we are seeing on one of the environment which doesn’t have much load. We are running kafka on multiple environement, and on one of our environemnt we do see events are taking huge time (some time more then a day) to get process from kafka. The topic have two partition, 3 replicase and two consumers are running on it (So one to one mapping between partition and consumer). When i run kafka-consumer-group.sh to find the stats, i can see lag on one of the consumer and then lag move to another consumer after some time, and they keep switching with time and increase time to process events. So look to me rebalancing is happening but at the same time consumer-id is same so consumer not getting started in between. We also tried to restart and kafka and zookeeper but end result is same, here is the detail.
[2018-10-12 03:52:21,676] WARN Removing server circle2-kafka2:909 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils) group-es group-rds [vikas@circle1-kafka1 kafka]$ ./bin/kafka-consumer-groups.sh --bootstrap-server circle1-kafka1:9092,circle2-kafka2:9092, circle1-kafka3 -describe -group group-rds Note: This will not show information about old Zookeeper-based consumers. [2018-10-12 03:53:06,226] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils) [2018-10-12 03:53:06,436] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils) TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID topic.events 1 45471 45471 0 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-dc1cb0e1-48fb-40c5-bd96-0e9980e1083d /172.27.4.133 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds topic.events 0 344987 346323 1336 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-3a13af04-048f-40b4-9b09-b74a9600dfd8 /172.27.4.133 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds [vikas@circle1-kafka1 kafka]$ ./bin/kafka-consumer-groups.sh --bootstrap-server circle1-kafka1:9092,circle2-kafka2:9092,circle1-kafka3 -describe -group group-rds Note: This will not show information about old Zookeeper-based consumers. [2018-10-12 04:04:29,725] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils) [2018-10-12 04:04:29,926] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils) TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID topic.events 1 44873 45471 598 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-dc1cb0e1-48fb-40c5-bd96-0e9980e1083d /172.27.4.133 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds topic.events 0 346324 346324 0 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-3a13af04-048f-40b4-9b09-b74a9600dfd8 /172.27.4.133 data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds Here is the info of kafka env 1)Version -> kafka_2.11-1.1.0 2)Zookeeper setting -> Default 3)kafka setting -> Most of the settings are default, here are few specific changes we have done zookeeper.connection.timeout.ms=6000 #Setting the replication for nodes under the default of 3 default.replication.factor=3 offsets.topic.replication.factor=3 transaction.state.log.replication.factor=3 config.storage.replication.factor=3 offset.storage.replication.factor=3 status.storage.replication.factor=3 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 log.retention.hours=24 Please do let me know in case you need more detail from my end. Your quick help is much appreciated, in case you are not able to help or i am at wrong group then please point me at right group. Regards, Vikas