from:"Samuel Azcona"

kafka stream zombie state

2022-08-16 Thread Samuel Azcona

Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
consumer leave group message. Exactly same issue described to another
person here:

https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state

But the issue is that stream state is continuing reporting that the stream
is running, but it's not consuming anything, but the stream never rejoin
the consumer group, so my application with only one replica stop consuming.

I have a health check on Kubernetes where I expose the stream state to then
restart the pod.
But as the kafka stream state it's always running when the consumer leaves
the group, the app is still healthy in zombie state, so I need to manually
go and restart the pod.

Is this a bug? Or is there a way to check what is the stream consumer state
to then expose as healt check for my application?

This issue really happen randomly, usually all the Mondays. I'm using Kafka
2.8.1 and my app is made in kotlin.

This is the message I get before zombie state, then there are no
exceptions, errors or nothing more until I restart the pod manually.

Sending LeaveGroup request to coordinator
b-3.c4.kafka.us-east-1.amazonaws.com:9098 (id: 2147483644 rack: null)
due to consumer poll timeout has expired. This means the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time processing messages. You can address this
either by increasing max.poll.interval.ms or by reducing the maximum
size of batches returned in poll() with max.poll.records.

Thanks for the help.

kafka stream zombie state

2022-08-16 Thread Samuel Azcona

Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
consumer leave group message. Exactly same issue described to another
person here:

https://stackoverflow.com/questions/61245480/how-to-detect-a-kafka-streams-app-in-zombie-state

Is this a bug? Or is there a way to check what is the stream consumer state
to then expose as healt check for my application?

This issue really happen randomly, usually all the Mondays. I'm using Kafka
2.8.1 and my app is made in kotlin.

This is the message I get before zombie state, then there are no
exceptions, errors or nothing more until I restart the pod manually.

Thanks for the help.

RE: Re: kafka stream zombie state

2022-08-19 Thread Samuel Azcona

Hi Sophie, thanks for your reply, im a big fan of your videos.

About the issue the logger comes from 

> org.apache.kafka.clients.consumer.internals.AbstractCoordinator

After the message I don’t see any other log or exception, just the stream stop 
consuming messages.

As you mention yea looks like the heartbeat thread is stuck. I already debug my 
topology and I was not able to reproduce the issue on my local, but this is 
happening so frequently on production.

Our application have multiple streams running in the same app with different 
> StreamsConfig.APPLICATION_ID_CONFIG

Like 
myapp-stream1
myapp-stream2
Myapp-stream3

The issue happens so randomly, some times to one stream other times to another. 
But for example when it happen to myapp-stream2, other streams continue working 
good, in some cases happen to only one, other times to two or more streams.

Our topology contains a multiple branches where depending on some conditions we 
delegate to different handlers. But we are doing joins between the stream and 
global ktable that have a global storage using rocksdb as you mention.

We don’t have nothing like a infinite loop or synchronized functions, etc. 
basically our topologies consume an event, and create a new message using that 
event as a source joining with a global ktable and then we send out the new 
message to another topic.

What I see frequently also its some core dumps from rocksdb
# JRE version: OpenJDK Runtime Environment Zulu11.58+15-CA (11.0.16+8) (build 
11.0.16+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu11.58+15-CA (11.0.16+8-LTS, mixed mode, 
tiered, compressed oops, serial gc, linux-amd64)
# Problematic frame:
# C [librocksdbjni13283553706008007881.so+0x44f8b1] 
rocksdb::ThreadPoolImpl::Impl::Submit(std::function&&, 
std::function&&, void*)+0x1d1

I was able to reproduce the message sending leave group blabla on my local by 
play with this property
MAX_POLL_INTERVAL_MS_CONFIG to 1ms and in my local I see the thread rebalancing 
and working good, I was not able yet to reproduce the issue locally.

We already try to increase the MAX_POLL_INTERVAL_MS_CONFIG to 10 minutes on 
production but the issue still happen.

We are using Kafka 2.8.1 and i see there is a bug report 
https://issues.apache.org/jira/browse/KAFKA-13310 affecting 2.8.1 and solved on 
3.2.0 but im not sure it my problem can be related?

We was able to create a hack to make our application unhealthy when its happen 
to make Kubernetes restart the pod.

We do it by capturing the logger from 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator and when the 
message contains Sending LeaveGroup request to coordinator we mark our app 
unhealthy. 

Because if not the app stay in zombie mode without consume and we don’t notice 
that. 

We are going to try to enable the kafka log in debug/trace mode on production 
to see if we can have a better logs.

Thanks a lot Sophie.

On 2022/08/19 09:18:21 Sophie Blee-Goldman wrote:
> Well it sounds like your app is getting stuck somewhere in the poll loop so
> it's unable to call poll
> again within the session timeout, as the error message indicates -- it's a
> bit misleading as it says
> "Sending LeaveGroup request to coordinator" which implies it's
> *currently* sending
> the LeaveGroup,
> but IIRC this error actually comes from the heartbeat thread -- just a long
> way of clarifying that the
> reason you don't see the state go into REBALANCING is because the
> StreamThread is stuck and
> can't rejoin the group by calling #poll
> 
> So...what now? I know your question was how to detect this, but I would
> recommend first trying to
> take a look into your application topology to figure out where, and *why*,
> it's getting stuck (sorry for
> the "q: how do I do X? answ. don't do X, do Y" StackOverflow-type response
> -- happy to help with
> that if we really can't resolve the underlying issue, I'll give it some
> thought since I can't think of any
> easy way to detect this off the top of my head)
> 
> What does your topology look like? Can you figure out what point it's
> hanging? You may need to turn
> on  DEBUG logging, or even TRACE, but given how infrequent/random this is
> I'm guessing that's off
> the table -- still, DEBUG logging at least would help.
> 
> Do you have any custom processors? Or anything in your topology that could
> possibly fall into an
> infinite loop? If not, I would suspect it's related to rocksdb -- but let's
> start with the other stuff before
> we go digging into that
> 
> Hope this helps,
> Sophie
> 
> On Tue, Aug 16, 2022 at 1:06 PM Samuel Azcona 
> wrote:
> 
> > Hi Guys, I'm having an issue with a kafka stream app, at some point I get a
> > consumer leave group message. Exac

Kafka stream moving from running to shutdown without any visible exception or error

2022-09-13 Thread Samuel Azcona

Hi guys,

I have a kotlin App that run 5 streams with different application.id, like:

my-app-stream1
my-app-stream2

some days a go I start experimenting, a very weird problem, all my streams
are in running state then in 3-5 minutes transition to shut down, I enable
the debug/trace logs, but I can't see any exception or error.

I have the setUncaughtExceptionHandler set with a log, but it's never
executed. So there are no exceptions inside the stream.

I already try to disable all my streams and just keep one not smart with a
foreach logging and the same happen, also I revert my code to a previous
state and the same happen.

I'm not able to reproduce in my local environment, this is happened to me
on production.

All the streams just change the state from running to pending shutdown

I only have some small logs like

State transition from RUNNING to PENDING_SHUTDOWN
.
State transition from PENDING_SHUTDOWN to NOT_RUNNING

Waiting for the I/O thread to exit. Hard shutdown in 3153600 ms.
...
Flushing all global globalStores registered in the state manager
...
Global thread has died. The streams application or client will now close to
ERROR
.
Kafka consumer has been closed

Exiting AdminClientRunnable

any ideas? Thanks guys!!

kafka stream zombie state

kafka stream zombie state

RE: Re: kafka stream zombie state

Kafka stream moving from running to shutdown without any visible exception or error

4 matches

Site Navigation

Mail list logo

Footer information