[ 
https://issues.apache.org/jira/browse/KAFKA-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Costa updated KAFKA-13615:
------------------------------
    Description: 
We are running a KafkaStreams application with largely default settings. 
Occasionally one of our consumers in the group takes too long between polls, 
and streams leaves the consumer group but the state of the application remains 
`RUNNING`. We are using the default `max.poll.interval.ms` of 5000.

The process stays alive with no exception that bubbles to our code, so when 
this occurs our app just kinda sits there idle until a manual restart is 
performed.

Here are the logs from around the time of the problem:
{code:java}
{"timestamp":"2022-01-24 
19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"} 
{"timestamp":"2022-01-24 
19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"}
{"timestamp":"2022-01-24 
20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | 
stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
 clientId=kubepodname-StreamThread-1-consumer, 
groupId=stage-us-1-fanout-logs-2c99] Member 
kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 
sending LeaveGroup request to coordinator 
b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to 
consumer poll timeout has expired. This means the time between subsequent calls 
to poll() was longer than the configured max.poll.interval.ms, which typically 
implies that the poll loop is spending too much time processing messages. You 
can address this either by increasing max.poll.interval.ms or by reducing the 
maximum size of batches returned in poll() with 
max.poll.records.","context":"default"} {code}
At this point the application entirely stops processing data. We initiated a 
shutdown by deleting the kubernetes pod, and the line printed immediately by 
kafka after the sprint boot shutdown initiation logs is the following:
{code:java}
{"timestamp":"2022-01-24 
20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] State transition from RUNNING to 
PENDING_SHUTDOWN","context":"default"}
 {code}
For a period of over a minute the application was in a state of hiatus where it 
had left the group, however it was still marked as being in a `RUNNING` state 
so we had no way to detect that the application had entered a bad state to kill 
it in an automated fashion. While the above logs are from an app that we shut 
down within a minute or two manually, we have seen this stay in a bad state for 
up to an hour before.

It feels like a bug to me that the streams consumer can leave the consumer 
group but not exit the `RUNNING` state. I tried searching for other bugs like 
this, but couldn't find any. Any ideas on how to detect this, or thoughts on 
whether this is actually a bug?

  was:
We are running a KafkaStreams application with largely default settings. 
Occasionally one of our consumers in the group takes too long between polls, 
and streams leaves the consumer group but the state of the application remains 
`RUNNING`. We are using the default `max.poll.interval.ms` of 5000.

The process stays alive with no exception that bubbles to our code, so when 
this occurs our app just kinda sits there idle until a manual restart is 
performed.

Here are the logs from around the time of the problem:
{code:java}
{"timestamp":"2022-01-24 
19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"} 
{"timestamp":"2022-01-24 
19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 punctuators, 
and committed 400 total tasks since the last update","context":"default"}
{"timestamp":"2022-01-24 
20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | 
stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
 clientId=kubepodname-StreamThread-1-consumer, 
groupId=stage-us-1-fanout-logs-2c99] Member 
kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 
sending LeaveGroup request to coordinator 
b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to 
consumer poll timeout has expired. This means the time between subsequent calls 
to poll() was longer than the configured max.poll.interval.ms, which typically 
implies that the poll loop is spending too much time processing messages. You 
can address this either by increasing max.poll.interval.ms or by reducing the 
maximum size of batches returned in poll() with 
max.poll.records.","context":"default"} {code}
At this point the application entirely stops processing data. We initiated a 
shutdown by deleting the kubernetes pod, and the line printed immediately by 
kafka is the following:
{code:java}
{"timestamp":"2022-01-24 
20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
 [kubepodname-StreamThread-1] State transition from RUNNING to 
PENDING_SHUTDOWN","context":"default"}
 {code}
For a period of over a minute the application was in a state of hiatus where it 
had left the group, however it was still marked as being in a `RUNNING` state 
so we had no way to detect that the application had entered a bad state to kill 
it in an automated fashion. While the above logs are from an app that we shut 
down within a minute or two manually, we have seen this stay in a bad state for 
up to an hour before.

It feels like a bug to me that the streams consumer can leave the consumer 
group but not exit the `RUNNING` state. I tried searching for other bugs like 
this, but couldn't find any. Any ideas on how to detect this, or thoughts on 
whether this is actually a bug?


> Kafka Streams does not transition state on LeaveGroup due to poll interval 
> being exceeded
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13615
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13615
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.1
>            Reporter: Tim Costa
>            Priority: Major
>
> We are running a KafkaStreams application with largely default settings. 
> Occasionally one of our consumers in the group takes too long between polls, 
> and streams leaves the consumer group but the state of the application 
> remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000.
> The process stays alive with no exception that bubbles to our code, so when 
> this occurs our app just kinda sits there idle until a manual restart is 
> performed.
> Here are the logs from around the time of the problem:
> {code:java}
> {"timestamp":"2022-01-24 
> 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 
> punctuators, and committed 400 total tasks since the last 
> update","context":"default"} {"timestamp":"2022-01-24 
> 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 
> punctuators, and committed 400 total tasks since the last 
> update","context":"default"}
> {"timestamp":"2022-01-24 
> 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | 
> stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
>  clientId=kubepodname-StreamThread-1-consumer, 
> groupId=stage-us-1-fanout-logs-2c99] Member 
> kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 
> sending LeaveGroup request to coordinator 
> b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to 
> consumer poll timeout has expired. This means the time between subsequent 
> calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time processing 
> messages. You can address this either by increasing max.poll.interval.ms or 
> by reducing the maximum size of batches returned in poll() with 
> max.poll.records.","context":"default"} {code}
> At this point the application entirely stops processing data. We initiated a 
> shutdown by deleting the kubernetes pod, and the line printed immediately by 
> kafka after the sprint boot shutdown initiation logs is the following:
> {code:java}
> {"timestamp":"2022-01-24 
> 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] State transition from RUNNING to 
> PENDING_SHUTDOWN","context":"default"}
>  {code}
> For a period of over a minute the application was in a state of hiatus where 
> it had left the group, however it was still marked as being in a `RUNNING` 
> state so we had no way to detect that the application had entered a bad state 
> to kill it in an automated fashion. While the above logs are from an app that 
> we shut down within a minute or two manually, we have seen this stay in a bad 
> state for up to an hour before.
> It feels like a bug to me that the streams consumer can leave the consumer 
> group but not exit the `RUNNING` state. I tried searching for other bugs like 
> this, but couldn't find any. Any ideas on how to detect this, or thoughts on 
> whether this is actually a bug?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to