[ https://issues.apache.org/jira/browse/KAFKA-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raman Gupta resolved KAFKA-10229. --------------------------------- Resolution: Invalid Not an issue with Kafka -- the code run by the stream was blocked. > Kafka stream dies for no apparent reason, no errors logged on client or server > ------------------------------------------------------------------------------ > > Key: KAFKA-10229 > URL: https://issues.apache.org/jira/browse/KAFKA-10229 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.4.1 > Reporter: Raman Gupta > Priority: Major > > My broker and clients are 2.4.1. I'm currently running a single broker. I > have a Kafka stream with exactly once processing turned on. I also have an > uncaught exception handler defined on the client. I have a stream which I > noticed was lagging. Upon investigation, I see that the consumer group was > empty. > On restarting the consumers, the consumer group re-established itself, but > after about 8 minutes, the group became empty again. There is nothing logged > on the client side about any stream errors, despite the existence of an > uncaught exception handler. > In the broker logs, I see that about 8 minutes after the clients restart / > the stream goes to RUNNING state: > ``` > [2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Member > cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 in group > produs-cisFileIndexer-stream has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > [2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Preparing to rebalance > group produs-cisFileIndexer-stream in state PreparingRebalance with old > generation 228 (__consumer_offsets-3) (reason: removing member > cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 on heartbeat > expiration) (kafka.coordinator.group.GroupCoordinator) > ``` > so according to this the consumer heartbeat has expired. I don't know why > this would be, logging shows that the stream was running and processing > messages normally and then just stopped processing anything about 4 minutes > before it dies, with no apparent errors or issues or anything logged via the > uncaught exception handler. > It doesn't appear to be related to any specific poison pill type messages: > restarting the stream causes it to reprocess a bunch more messages from the > backlog, and then die again approximately 8 minutes later. At the time of the > last message consumed by the stream, there are no `INFO`-level or above logs > either in the client or the broker, or any errors whatsoever. The stream > consumption simply stops. > There are two consumers -- even if I limit consumption to only a single > consumer, the same thing happens. > The runtime environment is Kubernetes. -- This message was sent by Atlassian Jira (v8.3.4#803005)