We are chasing a strange behavior for a while now - asking your help
because we ran out of ideas... maybe someone has a few good ideas/pointers?
Components we use:
* We are using Kafka 2.2.1 iirc cluster - 3 nodes setup
* Topics having 6 partitions
* We are working in Java and using
org.springframework.kafka:spring-kafka 2.6.3 (also tested with
latest 2.7.7 - did not help)
*The problem:*
In our use case we have a few topics where we do not have a continuous
consumer but we just fire one (only one at a time! - consuming all
partitions) up periodically / on demand. And we use MANUAL AckMode
because sometimes we just want to count the messages and not really
process them...
The consumer is created from Java code programmatically - done by this
snippet:
@Autowired
KafkaListenerContainerFactory kafkaListenerContainerFactory;
MessageListenerContainer listenerContainer =
kafkaListenerContainerFactory.createContainer(topicName);
listenerContainer.setupMessageListener(this); // our class
implements AcknowledgingMessageListener<Object, String>
listenerContainer.getContainerProperties().setGroupId("MyService");
listenerContainer.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);
listenerContainer.start();
... we wait until no more inbound messages (for X seconds,
timeout fashion)
listenerContainer.stop();
The strange thing is that however everything works as expected
*sometimes / somehow we get into a "state" we do not receive any
messages anymore when we start the consumer* - however we do know for
sure there are messages as our "MyService" consumer group there has an
offset lag (we have Prometheus+Grafana metrics from Kafka for that so we
see) ...
What makes the above even more weird:
* This "state" can self-heal after a while without doing anything so
everything returns to normal... however how long this state exists
veries a lot! Sometimes just 1-2 hours other times might take days
(and sometimes weeks)
* We have DEV, STAGE, PROD environment - PROD is more powerful
regarding Kafka side. The above weird behavior we experience in DEV
and STAGE - but PROD not (just happened once during 6 months, for 2
hours then gone)
What we can also add to the above:
* Retention period does not play role here. This state can happen
even with just a few hours old messages on the topics.
* When this weird state happens restarting the JVM (so the Java app)
does not help
* We can break out from this state with a small trick: when the
consumer is active we send a message to the topic. Then suddenly
Kafka somehow recovers from this state and suddenly starts to send
ALL the messages was not processed by "MyService" consumer group
Any idea is appreciated!
thanks
--
Attila Wind
http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932