nick allen created KAFKA-8654:
---------------------------------
Summary: Cant restart heartbeatThread if encountered unexpected
exception in heartbeatloop。
Key: KAFKA-8654
URL: https://issues.apache.org/jira/browse/KAFKA-8654
Project: Kafka
Issue Type: Bug
Components: consumer
Affects Versions: 2.1.0
Reporter: nick allen
There is a consumer in our cluster which have relatively high cpu usage for
several days caused by kafka poll thread. So we dig in to find out that was
because
org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
returned zero leading to non-blocking select which in turn leading to
pollForFetches returned immediately. But the actual poll timeout is set to 1s,
so pollForFetches was called thousands of time per poll/second.
We use tool to inspect memory variables which show the corresponding
heartbeatTimer's attribute:
@Timer[
time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
]
That shows that heartbeat hasn't been happening for about 10 days. And jstack
shows the corresponding heartbeatThread is dead. Unfortunately we dont keep
logs for that long so I cant figure out what happened then.
IMO heartbeatThread is too important to be left dead, there should at least be
some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only
be triggered by restarting or heartBeat itself.
It's also confusing that almost everything in
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
is async so it seems impossible for any exception to happen, so why there is
so many catch clause?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)