Scott Kidder created KAFKA-12231: ------------------------------------ Summary: Consumer Lag increases linearly until a Consumer-Group Rebalance is initiated Key: KAFKA-12231 URL: https://issues.apache.org/jira/browse/KAFKA-12231 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.6.0 Environment: Kubernetes 1.12 Reporter: Scott Kidder Attachments: Consumer Lag by Partition.png, Consumer Lag on a Single Partition.png, Lag drop on rebalance.png, max-consumer-lag.png
I observed a linear increase in consumer lag reading from a single topic (480 partitions) across multiple consumers for multiple hours. The increase in lag was stopped by initiating a consumer-group rebalance by replacing one of the consumers (this was in Kubernetes, so deleting a consumer pod and seeing its replacement pod join) at 07:46UTC on the chart below. !max-consumer-lag.png! The lag was observed across all topic partitions, but only briefly on each: !Consumer Lag by Partition.png! !Consumer Lag on a Single Partition.png! For additional context, this was a Golang consumer using v1.27.2 of the Shopify Sarama Kafka client. Consumers used the Sticky Partition Assignor to plan assignments. So, even after the consumer-group rebalance, the majority of consumers kept their original assignments. Nothing about the data being consumed & processed from Kafka could explain these punctuated spikes in consumer lag. There were no errors or significant messages in the Kafka broker logs before or after the rebalance. The lag dropped within 2 minutes of the consumer-group rebalance (initiated at 07:46, lag fell at 07:48): !Lag drop on rebalance.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)