slow consumer recovery after gc pause

Mahdi Ben Hamida Tue, 28 Feb 2017 09:10:53 -0800

Hello,

We are running kafka 0.9.0.1 and using the simple consumer to consumefrom a topic with 8 partitions. The consumer JVM infrequently runs intolarge gc pauses (60s to 90s, stop the world gc). These gcs are unrelatedto kafka. We are usually consuming at 5k messages/sec on the topic.Right after the garbage collection finishes, the consumption doesn'tresume its original rate until after about 9 to 10mn. The recovery isgradual, it slowly goes back to its steady rate.

I've enabled DEBUG logs on the kafka client and noticed the followinghappening right after the gc pause finished2017-02-10T03:21:30.932Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:Disconnecting from node 2 due to request timeout.2017-02-10T03:21:30.943Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:Disconnecting from node 3 due to request timeout.2017-02-10T03:21:30.943Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:Disconnecting from node 4 due to request timeout.2017-02-10T03:21:30.944Z DEBUG [o.a.kafka9.clients.NetworkClient ] {}:Disconnecting from node 5 due to request timeout.

There were also logs about Fetch failures due to the disconnects. Alsofrequently occurring "Marking the coordinator 2147483639 dead", which Ithink is a red herring since we are not using the high levelconsumer/consumer groups.

After looking at the client code, I found out that the clientdisconnects from a node if any of its its requests linger for more thanrequestTimeoutMs, which is 40s by default. Since the underlying socketconnections may still be healthy even after a 90s gc pause, I went aheadand changed that timeout to 2mn with the hope that it will keep thenodes from being disconnected (which I thought was the root cause of theslow recovery). However, that did not help either. With the new timeout,I don't see the disconnects happening anymore in the logs after the gcpause, however the recovery still has the same slow pattern which ispuzzling.


Has anyone encountered something like this ? Any help is much appreciated.

Thanks.

--
Mahdi.

slow consumer recovery after gc pause

Reply via email to