Hello,
We are running kafka 0.9.0.1 and using the simple consumer to consume
from a topic with 8 partitions. The consumer JVM infrequently runs into
large gc pauses (60s to 90s, stop the world gc). These gcs are unrelated
to kafka. We are usually consuming at 5k messages/sec on the topic.
Right after the garbage collection finishes, the consumption doesn't
resume its original rate until after about 9 to 10mn. The recovery is
gradual, it slowly goes back to its steady rate.
I've enabled DEBUG logs on the kafka client and noticed the following
happening right after the gc pause finished
2017-02-10T03:21:30.932Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:
Disconnecting from node 2 due to request timeout.
2017-02-10T03:21:30.943Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:
Disconnecting from node 3 due to request timeout.
2017-02-10T03:21:30.943Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:
Disconnecting from node 4 due to request timeout.
2017-02-10T03:21:30.944Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:
Disconnecting from node 5 due to request timeout.
There were also logs about Fetch failures due to the disconnects. Also
frequently occurring "Marking the coordinator 2147483639 dead", which I
think is a red herring since we are not using the high level
consumer/consumer groups.
After looking at the client code, I found out that the client
disconnects from a node if any of its its requests linger for more than
requestTimeoutMs, which is 40s by default. Since the underlying socket
connections may still be healthy even after a 90s gc pause, I went ahead
and changed that timeout to 2mn with the hope that it will keep the
nodes from being disconnected (which I thought was the root cause of the
slow recovery). However, that did not help either. With the new timeout,
I don't see the disconnects happening anymore in the logs after the gc
pause, however the recovery still has the same slow pattern which is
puzzling.
Has anyone encountered something like this ? Any help is much appreciated.
Thanks.
--
Mahdi.