Hi Chen, the lock exception can happen occasionally during rebalances. Since it is happening on a different thread, it's likely unrelated to your problem.
Normally, a poll timeout expires because a record takes too long to process. This can happen for example because the processing logic is slow, or it makes blocking external calls into other systems. It could be worth checking out the thread metrics to see the latency of individual phases (process, commit etc. latency-max) and see if there are any outliers. https://kafka.apache.org/documentation/#kafka_streams_thread_monitoring Hope that helps, Lucas On Tue, Nov 18, 2025 at 9:52 AM 李晨 <[email protected]> wrote: > > Hi team, > > > We have a Kafka stream application running with Kafka clients 3.8.1. We met a > strange issue and has no clue to find the root cause at this moment , please > help. > > > The issue was found because one of the partition lag is increasing, then we > checked the stream state, found one node has state stuck in rebalancing. > > > Then we checked logs. Only 2 logs found: > > > 2025-11-12T01:38:52.315+0800|WARN|kafka-coordinator-heartbeat-thread|Stream-xxxx|o.a.k.c.c.i.ConsumerCoordinator.handlerPollTimeoutExpiry[AbstractCoordinator.java:1147]|[Consumer > clientId=Stream-xxxx-StreamThread-11-consumer, groupId=Stream-xxxx] consumer > poll timeout has expired. This means the time between subsequent calls to > poll() was longer than the configured max.poll.interval.ms or by reducing the > maximum size of batches returned in poll() with max.poll.records. > > > 2025-11-12T01:39:01.382+0800|ERROR|Stream-xxxx-StreamThread-2|o.a.k.s.p.internals.StreamTask.closeStateManager[StateManagerUtil.java:149]|stream-thread > [Stream-xxxx-StreamThread-2] task [1_11] Failed to acquire lock while > closing the state store for Active task 1_11 > > > > I'm not sure if above error logs are related to the issue, but > (1) the log time is almost same as the time when we see the partition lag > start increasing > (2) the lag increasing partition is 11, same as the log mentioned task 1_11 > > > I have also tried dig existing JIRA issues to see if this is an known issue, > it looks a lot like > (1) KAFKA-16025: but this one should already fixed in 3.8.1? > (2) KAFKA-18355: but this bug said the new thread keep throwing the lock > exception, I only have one line error log related to the lock. > > > It seems like: the client met issue and try to change state from active to > rebalancing, but it failed before reach the request leave consumer group > part. As a result, no rebalancing happen, and no real consumer is processing > the partition data... > > > The issue happened on 3 different setups already, but unfortunately all of > them are running production environments, not much debug information I can > get for now :( > > > Looking forward to your reply. > Thanks > - Chen > > >
