Hi Chen,

the lock exception can happen occasionally during rebalances. Since it
is happening on a different thread, it's likely unrelated to your
problem.

Normally, a poll timeout expires because a record takes too long to
process. This can happen for example because the processing logic is
slow, or it makes blocking external calls into other systems.

It could be worth checking out the thread metrics to see the latency
of individual phases (process, commit etc. latency-max) and see if
there are any outliers.

https://kafka.apache.org/documentation/#kafka_streams_thread_monitoring

Hope that helps,
Lucas

On Tue, Nov 18, 2025 at 9:52 AM 李晨 <[email protected]> wrote:
>
> Hi team,
>
>
> We have a Kafka stream application running with Kafka clients 3.8.1. We met a 
> strange issue and has no clue to find the root cause at this moment , please 
> help.
>
>
> The issue was found because one of the partition lag is increasing, then we 
> checked the stream state, found one node has state stuck in rebalancing.
>
>
> Then we checked logs. Only 2 logs found:
>
>
> 2025-11-12T01:38:52.315+0800|WARN|kafka-coordinator-heartbeat-thread|Stream-xxxx|o.a.k.c.c.i.ConsumerCoordinator.handlerPollTimeoutExpiry[AbstractCoordinator.java:1147]|[Consumer
>  clientId=Stream-xxxx-StreamThread-11-consumer, groupId=Stream-xxxx] consumer 
> poll timeout has expired. This means the time between subsequent calls to 
> poll() was longer than the configured max.poll.interval.ms or by reducing the 
> maximum size of batches returned in poll() with max.poll.records.
>
>
> 2025-11-12T01:39:01.382+0800|ERROR|Stream-xxxx-StreamThread-2|o.a.k.s.p.internals.StreamTask.closeStateManager[StateManagerUtil.java:149]|stream-thread
>  [Stream-xxxx-StreamThread-2] task [1_11] Failed to acquire lock while 
> closing the state store for Active task 1_11
>
>
>
> I'm not sure if above error logs are related to the issue, but
> (1) the log time is almost same as the time when we see the partition lag 
> start increasing
> (2) the lag increasing partition is 11, same as the log mentioned task 1_11
>
>
> I have also tried dig existing JIRA issues to see if this is an known issue, 
> it looks a lot like
> (1) KAFKA-16025: but this one should already fixed in 3.8.1?
> (2) KAFKA-18355: but this bug said the new thread keep throwing the lock 
> exception, I only have one line error log related to the lock.
>
>
> It seems like: the client met issue and try to change state from active to 
> rebalancing, but it failed before reach the request leave consumer group 
> part. As a result, no rebalancing happen, and no real consumer is processing 
> the partition data...
>
>
> The issue happened on 3 different setups already, but unfortunately all of 
> them are running production  environments, not much debug information I can 
> get for now :(
>
>
> Looking forward to your reply.
> Thanks
> - Chen
>
>
>

Reply via email to