Can you also check if you have partition leaders flapping or changing rapidly?
Also, look at the following settings on your client configs:

max.partition.fetch.bytes
fetch.max.bytes
receive.buffer.bytes

We had a similar situation in our environment when the brokers were flooded 
with data.
The symptoms where apparent huge spikes in offset ids - much more than the data 
were sending.
That we traced to the fact that the brokers were not able to keep up with the 
incoming producer + consumer + replication traffic due to the NIC bandwidth.
(A bit of a lengthy story as to why the offset ids appeared to be high/spiky 
because of the flapping).

And then the consumer would have issues - and the problem there was that the 
producer had a very large buffer and batch size - so the data was coming in 
large batches.
However the client was configured to receive data in such large batches and it 
would give errors and would not be able to go past a certain offset.


On 11/30/17, 3:03 AM, "Tom van den Berge" <tom.vandenbe...@gmail.com> wrote:

    The consumers are using default settings, which means that
    enable.auto.commit=true and auto.commit.interval.ms=5000. I'm not
    committing manually; just consuming messages.
    
    On Thu, Nov 30, 2017 at 1:09 AM, Frank Lyaruu <flya...@gmail.com> wrote:
    
    > Do you commit the received messages? Either by doing it manually or 
setting
    > enable.auto.commit and auto.commit.interval.ms?
    >
    > On Wed, Nov 29, 2017 at 11:15 PM, Tom van den Berge <
    > tom.vandenbe...@gmail.com> wrote:
    >
    > > I'm using Kafka 0.10.0.
    > >
    > > I'm reading messages from a single topic (20 partitions), using 4
    > consumers
    > > (one group), using a standard java consumer with default configuration,
    > > except for the key and value deserializer, and a group id; no other
    > > settings.
    > >
    > > We've been experiencing a serious problem a few times now, after a large
    > > burst of messages (75000) have been posted to the topic. The consumer 
lag
    > > (as reported by Kafka's kafka-consumer-groups.sh) immediately shows a
    > huge
    > > lag, which is expected. The consumers start processing the messages,
    > which
    > > is expected to take them at least 30 minutes. In the mean time, more
    > > messages are posted to the topic, but at a "normal" rate, which the
    > > consumers normally handle easily. The problem is that the reported
    > consumer
    > > lag is not decreasing at all. After some 30 minutes, it has even
    > increased
    > > slightly. This would mean that the consumers are not able to process the
    > > backlog at all, which is extremely unlikely.
    > >
    > > After a restart of all consumer applications, something really 
surprising
    > > happens: the lag immediately drops to nearly 0! It is technically
    > > impossible that the consumers really processed all messages in a matter
    > of
    > > seconds. Manual verification showed that many messages were not 
processed
    > > at all; they seem to have disappeared somehow. So it seems that
    > restarting
    > > the consumers somehow messed up the offset (I think).
    > >
    > > On top of that, I noticed that the reported lag shows seemingly
    > impossible
    > > figures. During the time that the lag was not decreasing, before the
    > > restart of the consumers, the "current offset" that was reported for 
some
    > > partitions decreased. To my knowledge, that is impossible.
    > >
    > > Does anyone have an idea on how this could have happened?
    > >
    >
    

Reply via email to