[ https://issues.apache.org/jira/browse/KAFKA-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
nick allen updated KAFKA-8654: ------------------------------ Summary: Cant restart heartbeatThread if encountered unexpected exception in heartbeatloop. (was: Cant restart heartbeatThread if encountered unexpected exception in heartbeatloop。) > Cant restart heartbeatThread if encountered unexpected exception in > heartbeatloop. > ---------------------------------------------------------------------------------- > > Key: KAFKA-8654 > URL: https://issues.apache.org/jira/browse/KAFKA-8654 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 2.1.0 > Reporter: nick allen > Priority: Major > > There is a consumer in our cluster which have relatively high cpu usage for > several days caused by kafka poll thread. So we dig in to find out that was > because > org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat > returned zero leading to non-blocking select which in turn leading to > pollForFetches returned immediately. But the actual poll timeout is set to > 1s, so pollForFetches was called thousands of time per poll/second. > We use tool to inspect memory variables which show the corresponding > heartbeatTimer's attribute: > @Timer[ > time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627], > startMs=@Long[1562075783801], // Jul 02 2019 13:56:23 > currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21 > deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33 > ] > That shows that heartbeat hasn't been happening for about 10 days. And jstack > shows the corresponding heartbeatThread is dead. Unfortunately we dont keep > logs for that long so I cant figure out what happened then. > IMO heartbeatThread is too important to be left dead, there should be at > least some way to revive it, but it seems that startHeartbeatThreadIfNeeded > can only be triggered by restarting or heartBeat itself. > It's also confusing that almost everything in > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run > is async so it seems impossible for any exception to happen, so why there is > so many catch clause? > -- This message was sent by Atlassian JIRA (v7.6.14#76016)