kafka stream zombie state
Hi Guys, I'm having an issue with a kafka stream app, at some point I get a consumer leave group message. Exactly same issue described to another person here: https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state But the issue is that stream state is continuing reporting that the stream is running, but it's not consuming anything, but the stream never rejoin the consumer group, so my application with only one replica stop consuming. I have a health check on Kubernetes where I expose the stream state to then restart the pod. But as the kafka stream state it's always running when the consumer leaves the group, the app is still healthy in zombie state, so I need to manually go and restart the pod. Is this a bug? Or is there a way to check what is the stream consumer state to then expose as healt check for my application? This issue really happen randomly, usually all the Mondays. I'm using Kafka 2.8.1 and my app is made in kotlin. This is the message I get before zombie state, then there are no exceptions, errors or nothing more until I restart the pod manually. Sending LeaveGroup request to coordinator b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. Thanks for the help.
kafka stream zombie state
Hi Guys, I'm having an issue with a kafka stream app, at some point I get a consumer leave group message. Exactly same issue described to another person here: https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state But the issue is that stream state is continuing reporting that the stream is running, but it's not consuming anything, but the stream never rejoin the consumer group, so my application with only one replica stop consuming. I have a health check on Kubernetes where I expose the stream state to then restart the pod. But as the kafka stream state it's always running when the consumer leaves the group, the app is still healthy in zombie state, so I need to manually go and restart the pod. Is this a bug? Or is there a way to check what is the stream consumer state to then expose as healt check for my application? This issue really happen randomly, usually all the Mondays. I'm using Kafka 2.8.1 and my app is made in kotlin. This is the message I get before zombie state, then there are no exceptions, errors or nothing more until I restart the pod manually. Sending LeaveGroup request to coordinator b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. Thanks for the help.
RE: Re: kafka stream zombie state
Hi Sophie, thanks for your reply, im a big fan of your videos. About the issue the logger comes from > org.apache.kafka.clients.consumer.internals.AbstractCoordinator After the message I don’t see any other log or exception, just the stream stop consuming messages. As you mention yea looks like the heartbeat thread is stuck. I already debug my topology and I was not able to reproduce the issue on my local, but this is happening so frequently on production. Our application have multiple streams running in the same app with different > StreamsConfig.APPLICATION_ID_CONFIG Like myapp-stream1 myapp-stream2 Myapp-stream3 The issue happens so randomly, some times to one stream other times to another. But for example when it happen to myapp-stream2, other streams continue working good, in some cases happen to only one, other times to two or more streams. Our topology contains a multiple branches where depending on some conditions we delegate to different handlers. But we are doing joins between the stream and global ktable that have a global storage using rocksdb as you mention. We don’t have nothing like a infinite loop or synchronized functions, etc. basically our topologies consume an event, and create a new message using that event as a source joining with a global ktable and then we send out the new message to another topic. What I see frequently also its some core dumps from rocksdb # JRE version: OpenJDK Runtime Environment Zulu11.58+15-CA (11.0.16+8) (build 11.0.16+8-LTS) # Java VM: OpenJDK 64-Bit Server VM Zulu11.58+15-CA (11.0.16+8-LTS, mixed mode, tiered, compressed oops, serial gc, linux-amd64) # Problematic frame: # C [librocksdbjni13283553706008007881.so+0x44f8b1] rocksdb::ThreadPoolImpl::Impl::Submit(std::function&&, std::function&&, void*)+0x1d1 I was able to reproduce the message sending leave group blabla on my local by play with this property MAX_POLL_INTERVAL_MS_CONFIG to 1ms and in my local I see the thread rebalancing and working good, I was not able yet to reproduce the issue locally. We already try to increase the MAX_POLL_INTERVAL_MS_CONFIG to 10 minutes on production but the issue still happen. We are using Kafka 2.8.1 and i see there is a bug report https://issues.apache.org/jira/browse/KAFKA-13310 affecting 2.8.1 and solved on 3.2.0 but im not sure it my problem can be related? We was able to create a hack to make our application unhealthy when its happen to make Kubernetes restart the pod. We do it by capturing the logger from org.apache.kafka.clients.consumer.internals.AbstractCoordinator and when the message contains Sending LeaveGroup request to coordinator we mark our app unhealthy. Because if not the app stay in zombie mode without consume and we don’t notice that. We are going to try to enable the kafka log in debug/trace mode on production to see if we can have a better logs. Thanks a lot Sophie. On 2022/08/19 09:18:21 Sophie Blee-Goldman wrote: > Well it sounds like your app is getting stuck somewhere in the poll loop so > it's unable to call poll > again within the session timeout, as the error message indicates -- it's a > bit misleading as it says > "Sending LeaveGroup request to coordinator" which implies it's > *currently* sending > the LeaveGroup, > but IIRC this error actually comes from the heartbeat thread -- just a long > way of clarifying that the > reason you don't see the state go into REBALANCING is because the > StreamThread is stuck and > can't rejoin the group by calling #poll > > So...what now? I know your question was how to detect this, but I would > recommend first trying to > take a look into your application topology to figure out where, and *why*, > it's getting stuck (sorry for > the "q: how do I do X? answ. don't do X, do Y" StackOverflow-type response > -- happy to help with > that if we really can't resolve the underlying issue, I'll give it some > thought since I can't think of any > easy way to detect this off the top of my head) > > What does your topology look like? Can you figure out what point it's > hanging? You may need to turn > on DEBUG logging, or even TRACE, but given how infrequent/random this is > I'm guessing that's off > the table -- still, DEBUG logging at least would help. > > Do you have any custom processors? Or anything in your topology that could > possibly fall into an > infinite loop? If not, I would suspect it's related to rocksdb -- but let's > start with the other stuff before > we go digging into that > > Hope this helps, > Sophie > > On Tue, Aug 16, 2022 at 1:06 PM Samuel Azcona > wrote: > > > Hi Guys, I'm having an issue with a kafka stream app, at some point I get a > > consumer leave group message. Exac
Kafka stream moving from running to shutdown without any visible exception or error
Hi guys, I have a kotlin App that run 5 streams with different application.id, like: my-app-stream1 my-app-stream2 some days a go I start experimenting, a very weird problem, all my streams are in running state then in 3-5 minutes transition to shut down, I enable the debug/trace logs, but I can't see any exception or error. I have the setUncaughtExceptionHandler set with a log, but it's never executed. So there are no exceptions inside the stream. I already try to disable all my streams and just keep one not smart with a foreach logging and the same happen, also I revert my code to a previous state and the same happen. I'm not able to reproduce in my local environment, this is happened to me on production. All the streams just change the state from running to pending shutdown I only have some small logs like State transition from RUNNING to PENDING_SHUTDOWN . State transition from PENDING_SHUTDOWN to NOT_RUNNING Waiting for the I/O thread to exit. Hard shutdown in 3153600 ms. ... Flushing all global globalStores registered in the state manager ... Global thread has died. The streams application or client will now close to ERROR . Kafka consumer has been closed Exiting AdminClientRunnable any ideas? Thanks guys!!