Hi everyone,

We've seen some instances of our consumer groups when running normally not
process any messages from some partitions for minutes while other
partitions are seeing regularly updates in seconds. In some cases when a
consumer group had a significant lag (hours of messages), some partitions
would not see any processing for up to an hour until the lag came back to
normal.

Following the code it looks like the client sends a fetch request per
leader with all the partitions that broker is leader of[1] and the leader
then responds by reading messages off each partition sequentially until max
fetch bytes is reached[2]. From this it seems like a consumer might not
read any messages from a partition for some time until the consumer has
caught up on these other partitions. Does this seem plausible?

We determine this stalled processing from a monitoring tool that reads the
offset commit messages and tracks the commit time for each group's
partitions, so we effectively have the last time an offset was committed
for each partition.

We are currently using Kafka 0.9.0.1 and the new kafka consumer API.

[1] -
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L800
[2] -
https://github.com/apache/kafka/blob/ae3eeb3e1c194c60fa62fbc7931566f3d8c87d68/core/src/main/scala/kafka/server/ReplicaManager.scala#L935-L943

Bryan

Reply via email to