This problem was solved by upgrading from 0.10 to 0.11 (broker + client). Thanks for your feedback.
On Thu, Nov 30, 2017 at 10:03 AM, Tom van den Berge < tom.vandenbe...@gmail.com> wrote: > The consumers are using default settings, which means that > enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not > committing manually; just consuming messages. > > On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <flya...@gmail.com> wrote: > >> Do you commit the received messages? Either by doing it manually or >> setting >> enable.auto.commit and auto.commit.interval.ms? >> >> On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge < >> tom.vandenbe...@gmail.com> wrote: >> >> > I'm using Kafka 0.10.0. >> > >> > I'm reading messages from a single topic (20 partitions), using 4 >> consumers >> > (one group), using a standard java consumer with default configuration, >> > except for the key and value deserializer, and a group id; no other >> > settings. >> > >> > We've been experiencing a serious problem a few times now, after a large >> > burst of messages (75000) have been posted to the topic. The consumer >> lag >> > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a >> huge >> > lag, which is expected. The consumers start processing the messages, >> which >> > is expected to take them at least 30 minutes. In the mean time, more >> > messages are posted to the topic, but at a "normal" rate, which the >> > consumers normally handle easily. The problem is that the reported >> consumer >> > lag is not decreasing at all. After some 30 minutes, it has even >> increased >> > slightly. This would mean that the consumers are not able to process the >> > backlog at all, which is extremely unlikely. >> > >> > After a restart of all consumer applications, something really >> surprising >> > happens: the lag immediately drops to nearly 0! It is technically >> > impossible that the consumers really processed all messages in a matter >> of >> > seconds. Manual verification showed that many messages were not >> processed >> > at all; they seem to have disappeared somehow. So it seems that >> restarting >> > the consumers somehow messed up the offset (I think). >> > >> > On top of that, I noticed that the reported lag shows seemingly >> impossible >> > figures. During the time that the lag was not decreasing, before the >> > restart of the consumers, the "current offset" that was reported for >> some >> > partitions decreased. To my knowledge, that is impossible. >> > >> > Does anyone have an idea on how this could have happened? >> > >> > >