Hi everyone, We've seen some instances of our consumer groups when running normally not process any messages from some partitions for minutes while other partitions are seeing regularly updates in seconds. In some cases when a consumer group had a significant lag (hours of messages), some partitions would not see any processing for up to an hour until the lag came back to normal.
Following the code it looks like the client sends a fetch request per leader with all the partitions that broker is leader of[1] and the leader then responds by reading messages off each partition sequentially until max fetch bytes is reached[2]. From this it seems like a consumer might not read any messages from a partition for some time until the consumer has caught up on these other partitions. Does this seem plausible? We determine this stalled processing from a monitoring tool that reads the offset commit messages and tracks the commit time for each group's partitions, so we effectively have the last time an offset was committed for each partition. We are currently using Kafka 0.9.0.1 and the new kafka consumer API. [1] - https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L800 [2] - https://github.com/apache/kafka/blob/ae3eeb3e1c194c60fa62fbc7931566f3d8c87d68/core/src/main/scala/kafka/server/ReplicaManager.scala#L935-L943 Bryan