[ https://issues.apache.org/jira/browse/KAFKA-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298761#comment-16298761 ]
Jason Gustafson commented on KAFKA-6366: ---------------------------------------- [~joerg.heinicke] Thanks for sharing the logs. One thing that immediately stands out is the large number of async offset commit failures. I counted 13,359 instances. Considering the "Marking coordinator dead" messages, there are about 10,862 instances. This is just a guess, but do you have any retry logic implemented for when async offset commits fail? That would explain the large number of "Marking coordinator dead" messages as well as the stack overflow. > StackOverflowError in kafka-coordinator-heartbeat-thread > -------------------------------------------------------- > > Key: KAFKA-6366 > URL: https://issues.apache.org/jira/browse/KAFKA-6366 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 1.0.0 > Reporter: Joerg Heinicke > Attachments: 6366.v1.txt, ConverterProcessor.zip > > > With Kafka 1.0 our consumer groups fall into a permanent cycle of rebalancing > once a StackOverflowError in the heartbeat thread occurred due to > connectivity issues of the consumers to the coordinating broker: > Immediately before the exception there are hundreds, if not thousands of log > entries of following type: > 2017-12-12 16:23:12.361 [kafka-coordinator-heartbeat-thread | > my-consumer-group] INFO - [Consumer clientId=consumer-4, > groupId=my-consumer-group] Marking the coordinator <IP>:<Port> (id: > 2147483645 rack: null) dead > The exceptions always happen somewhere in the DateFormat code, even > though at different lines. > 2017-12-12 16:23:12.363 [kafka-coordinator-heartbeat-thread | > my-consumer-group] ERROR - Uncaught exception in thread > 'kafka-coordinator-heartbeat-thread | my-consumer-group': > java.lang.StackOverflowError > at > java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:362) > at > java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340) > at java.util.Calendar.getDisplayName(Calendar.java:2110) > at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966) > at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936) > at java.text.DateFormat.format(DateFormat.java:345) > at > org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(PatternParser.java:443) > at > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > at org.apache.log4j.PatternLayout.format(PatternLayout.java:506) > at > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) > at > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > at > org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324) > at > org.apache.kafka.common.utils.LogContext$KafkaLogger.info(LogContext.java:341) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:649) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > ... > the following 9 lines are repeated around hundred times. > ... > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.failUnsentRequests(ConsumerNetworkClient.java:416) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.disconnect(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.coordinatorDead(AbstractCoordinator.java:653) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onFailure(AbstractCoordinator.java:797) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:496) -- This message was sent by Atlassian JIRA (v6.4.14#64029)