Hi;

With default configuration, your consumers are set with auto.offset.reset=latest
So on restart, the consumers start to read the offset of 0 minutes ago, not the 
offset of 30 minutes ago (or whatever the lag was).

https://kafka.apache.org/documentation/#configuration
auto.offset.reset
What to do when there is no initial offset in Kafka or if the current offset 
does not exist anymore on the server (e.g. because that data has been deleted):
    earliest: automatically reset the offset to the earliest offset
    latest: automatically reset the offset to the latest offset
    none: throw exception to the consumer if no previous offset is found for 
the consumer's group
    anything else: throw exception to the consumer.

For the "current offset" that seems to decrease, I have no idea.

Isabelle Giguère
Computational Linguist and Java Developer
Linguiste informaticienne et développeur Java

_________
Open Text
The Content Experts

-----Original Message-----
From: Tom van den Berge [mailto:tom.vandenbe...@gmail.com] 
Sent: 29 novembre 2017 17:16
To: users@kafka.apache.org
Subject: [EXTERNAL] - Lost messages and messed up offsets

I'm using Kafka 0.10.0.

I'm reading messages from a single topic (20 partitions), using 4 consumers 
(one group), using a standard java consumer with default configuration, except 
for the key and value deserializer, and a group id; no other settings.

We've been experiencing a serious problem a few times now, after a large burst 
of messages (75000) have been posted to the topic. The consumer lag (as 
reported by Kafka's kafka-consumer-groups.sh) immediately shows a huge lag, 
which is expected. The consumers start processing the messages, which is 
expected to take them at least 30 minutes. In the mean time, more messages are 
posted to the topic, but at a "normal" rate, which the consumers normally 
handle easily. The problem is that the reported consumer lag is not decreasing 
at all. After some 30 minutes, it has even increased slightly. This would mean 
that the consumers are not able to process the backlog at all, which is 
extremely unlikely.

After a restart of all consumer applications, something really surprising
happens: the lag immediately drops to nearly 0! It is technically impossible 
that the consumers really processed all messages in a matter of seconds. Manual 
verification showed that many messages were not processed at all; they seem to 
have disappeared somehow. So it seems that restarting the consumers somehow 
messed up the offset (I think).

On top of that, I noticed that the reported lag shows seemingly impossible 
figures. During the time that the lag was not decreasing, before the restart of 
the consumers, the "current offset" that was reported for some partitions 
decreased. To my knowledge, that is impossible.

Does anyone have an idea on how this could have happened?

Reply via email to