[ 
https://issues.apache.org/jira/browse/KAFKA-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]
nick allen updated KAFKA-8654:
------------------------------
    Summary: Cant restart heartbeatThread if encountered unexpected exception 
in heartbeatloop.  (was: Cant restart heartbeatThread if encountered unexpected 
exception in heartbeatloop。)

> Cant restart heartbeatThread if encountered unexpected exception in 
> heartbeatloop.
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-8654
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8654
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 2.1.0
>            Reporter: nick allen
>            Priority: Major
>
> There is a consumer in our cluster which have relatively high cpu usage for 
> several days caused by kafka poll thread. So we dig in to find out that was 
> because 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
>  returned zero leading to non-blocking select which in turn leading to 
> pollForFetches returned immediately. But the actual poll timeout is set to 
> 1s, so pollForFetches was called thousands of time per poll/second.
> We use tool to inspect memory variables which show the corresponding 
> heartbeatTimer's attribute:  
> @Timer[
>  time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
>  startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
>  currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
>  deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
>  ]
> That shows that heartbeat hasn't been happening for about 10 days. And jstack 
> shows the corresponding heartbeatThread is dead. Unfortunately we dont keep 
> logs for that long so I cant figure out what happened then. 
> IMO heartbeatThread is too important to be left dead, there should be at 
> least some way to revive it, but it seems that startHeartbeatThreadIfNeeded 
> can only be triggered by restarting or heartBeat itself.
> It's also confusing that almost everything in 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
>  is async so it seems impossible for any exception to happen, so why there is 
> so many catch clause?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to