This problem was solved by upgrading from 0.10 to 0.11 (broker + client).

Thanks for your feedback.


On Thu, Nov 30, 2017 at 10:03 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> The consumers are using default settings, which means that
> enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not
> committing manually; just consuming messages.
>
> On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <flya...@gmail.com> wrote:
>
>> Do you commit the received messages? Either by doing it manually or
>> setting
>> enable.auto.commit and auto.commit.interval.ms?
>>
>> On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>> > I'm using Kafka 0.10.0.
>> >
>> > I'm reading messages from a single topic (20 partitions), using 4
>> consumers
>> > (one group), using a standard java consumer with default configuration,
>> > except for the key and value deserializer, and a group id; no other
>> > settings.
>> >
>> > We've been experiencing a serious problem a few times now, after a large
>> > burst of messages (75000) have been posted to the topic. The consumer
>> lag
>> > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a
>> huge
>> > lag, which is expected. The consumers start processing the messages,
>> which
>> > is expected to take them at least 30 minutes. In the mean time, more
>> > messages are posted to the topic, but at a "normal" rate, which the
>> > consumers normally handle easily. The problem is that the reported
>> consumer
>> > lag is not decreasing at all. After some 30 minutes, it has even
>> increased
>> > slightly. This would mean that the consumers are not able to process the
>> > backlog at all, which is extremely unlikely.
>> >
>> > After a restart of all consumer applications, something really
>> surprising
>> > happens: the lag immediately drops to nearly 0! It is technically
>> > impossible that the consumers really processed all messages in a matter
>> of
>> > seconds. Manual verification showed that many messages were not
>> processed
>> > at all; they seem to have disappeared somehow. So it seems that
>> restarting
>> > the consumers somehow messed up the offset (I think).
>> >
>> > On top of that, I noticed that the reported lag shows seemingly
>> impossible
>> > figures. During the time that the lag was not decreasing, before the
>> > restart of the consumers, the "current offset" that was reported for
>> some
>> > partitions decreased. To my knowledge, that is impossible.
>> >
>> > Does anyone have an idea on how this could have happened?
>> >
>>
>
>

Reply via email to