I'm using Kafka 0.10.0.

I'm reading messages from a single topic (20 partitions), using 4 consumers
(one group), using a standard java consumer with default configuration,
except for the key and value deserializer, and a group id; no other
settings.

We've been experiencing a serious problem a few times now, after a large
burst of messages (75000) have been posted to the topic. The consumer lag
(as reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge
lag, which is expected. The consumers start processing the messages, which
is expected to take them at least 30 minutes. In the mean time, more
messages are posted to the topic, but at a "normal" rate, which the
consumers normally handle easily. The problem is that the reported consumer
lag is not decreasing at all. After some 30 minutes, it has even increased
slightly. This would mean that the consumers are not able to process the
backlog at all, which is extremely unlikely.

After a restart of all consumer applications, something really surprising
happens: the lag immediately drops to nearly 0! It is technically
impossible that the consumers really processed all messages in a matter of
seconds. Manual verification showed that many messages were not processed
at all; they seem to have disappeared somehow. So it seems that restarting
the consumers somehow messed up the offset (I think).

On top of that, I noticed that the reported lag shows seemingly impossible
figures. During the time that the lag was not decreasing, before the
restart of the consumers, the "current offset" that was reported for some
partitions decreased. To my knowledge, that is impossible.

Does anyone have an idea on how this could have happened?

Reply via email to