We are chasing a strange behavior for a while now - asking your help because we ran out of ideas... maybe someone has a few good ideas/pointers?

Components we use:

 *   We are using Kafka 2.2.1 iirc cluster - 3 nodes setup
 *   Topics having 6 partitions
 *   We are working in Java and using
   org.springframework.kafka:spring-kafka 2.6.3 (also tested with
   latest 2.7.7 - did not help)


*The problem:*
In our use case we have a few topics where we do not have a continuous consumer but we just fire one (only one at a time! - consuming all partitions) up periodically / on demand. And we use MANUAL AckMode because sometimes we just want to count the messages and not really process them... The consumer is created from Java code programmatically - done by this snippet:

        @Autowired
        KafkaListenerContainerFactory kafkaListenerContainerFactory;

        MessageListenerContainer listenerContainer = kafkaListenerContainerFactory.createContainer(topicName);         listenerContainer.setupMessageListener(this);   // our class implements AcknowledgingMessageListener<Object, String>
listenerContainer.getContainerProperties().setGroupId("MyService");
listenerContainer.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);
        listenerContainer.start();

        ... we wait until no more inbound messages (for X seconds, timeout fashion)

        listenerContainer.stop();

The strange thing is that however everything works as expected
*sometimes / somehow we get into a "state" we do not receive any messages anymore when we start the consumer* - however we do know for sure there are messages as our "MyService" consumer group there has an offset lag (we have Prometheus+Grafana metrics from Kafka for that so we see) ...

What makes the above even more weird:

 *   This "state" can self-heal after a while without doing anything so
   everything returns to normal... however how long this state exists
   veries a lot! Sometimes just 1-2 hours other times might take days
   (and sometimes weeks)
 *   We have DEV, STAGE, PROD environment - PROD is more powerful
   regarding Kafka side. The above weird behavior we experience in DEV
   and STAGE - but PROD not (just happened once during 6 months, for 2
   hours then gone)


What we can also add to the above:

 *   Retention period does not play role here. This state can happen
   even with just a few hours old messages on the topics.
 *   When this weird state happens restarting the JVM (so the Java app)
   does not help
 *   We can break out from this state with a small trick: when the
   consumer is active we send a message to the topic. Then suddenly
   Kafka somehow recovers from this state and suddenly starts to send
   ALL the messages was not processed by "MyService" consumer group

 Any idea is appreciated!

thanks

--
Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


Reply via email to