[ https://issues.apache.org/jira/browse/KAFKA-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989401#comment-16989401 ]
John Roesler commented on KAFKA-9270: ------------------------------------- Hey Rohan, I see where you’re coming from, but I definitely wouldn’t set the config super high. You don’t want the Streams thread to just pause indefinitely waiting for a single commit, holding up progress on the rest of its tasks. Better to pause a moderate amount of time and then let the thread die if the call is still hung. At least then it would give up its tasks, and some other thread can take over. Note also that you have to pay attention to the max poll interval config. If the commit call is stuck waiting, then the thread isn’t calling poll either, so it could drop out of the consumer group. Personally, I’d probably set them to something like 10 minutes, which would be a pretty long “glitch” for the broker, but also might be tolerable from a liveness perspective. Then, I’d add monitoring for thread deaths. Maybe even an external sidecar process to bounce Streams if some of the threads die. Not ideal, I know, but hopefully enough to help you bide your time until we work through the proposal that Boyang has started to actually fix this problem for you (see the linked ticket). > KafkaStream crash on offset commit failure > ------------------------------------------ > > Key: KAFKA-9270 > URL: https://issues.apache.org/jira/browse/KAFKA-9270 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.0.1 > Reporter: Rohan Kulkarni > Priority: Critical > > On our Production server we intermittently observe Kafka Streams get crashed > with TimeoutException while committing offset. The only workaround seems to > be restarting the application which is not a desirable solution for a > production environment. > > While have already implemented ProductionExceptionHandler which does not > seems to address this. > > Please provide a fix for this or a viable workaround. > > +Application side logs:+ > 2019-11-17 08:28:48.055 +0000 > [AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-1] [ERROR] - > org.apache.kafka.streams.processor.internals.AssignedStreamsTasks > [org.apache.kafka.streams.processor.internals.AssignedTasks:applyToRunningTasks:373] > - stream-thread > [AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-1] *Failed to > commit stream task 0_1 due to the following error:* > *org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired > before successfully committing offsets* > \{AggregateJob-1=OffsetAndMetadata{offset=176729402, metadata=''}} > > 2019-11-17 08:29:00.891 +0000 > [AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-1] [ERROR] - > [:lambda$init$2:130] - Stream crashed!!! StreamsThread threadId: > AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-12019-11-17 > 08:29:00.891 +0000 > [AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-1] [ERROR] - > [:lambda$init$2:130] - Stream crashed!!! StreamsThread threadId: > AggregateJob-614fe688-c9a4-4dad-a881-71488030918b-StreamThread-1TaskManager > MetadataState: GlobalMetadata: [] GlobalStores: [] My HostInfo: > HostInfo\{host='unknown', port=-1} Cluster(id = null, nodes = [], partitions > = [], controller = null) Active tasks: Running: Suspended: Restoring: New: > Standby tasks: Running: Suspended: Restoring: New: > org.apache.kafka.common.errors.*TimeoutException: Timeout of 60000ms expired > before successfully committing offsets* > \{AggregateJob-0=OffsetAndMetadata{offset=189808059, metadata=''}} > > +Kafka broker logs:+ > [2019-11-17 13:53:22,774] WARN *Client session timed out, have not heard from > server in 6669ms for sessionid 0x10068e4a2944c2f* > (org.apache.zookeeper.ClientCnxn) > [2019-11-17 13:53:22,809] INFO Client session timed out, have not heard from > server in 6669ms for sessionid 0x10068e4a2944c2f, closing socket connection > and attempting reconnect (org.apache.zookeeper.ClientCnxn) > > Regards, > Rohan -- This message was sent by Atlassian Jira (v8.3.4#803005)