We have been investigating an unreasonably high CPU usage of the Kafka
process when there's no _real_ activity going on between the consumers
and the broker. We had this issue in 0.8.x days and is exactly the same
as what's being tracked in this JIRA
https://issues.apache.org/jira/browse/KAFKA-493. We now use 0.9.0.1
(both client libraries, new consumer APIs and the broker). However, we
still see some CPU usage which looks a bit on the higher side when
there's no real message production or consumption going on. Just
connecting around 10-20 consumers on different topics of a single broker
Kafka instance shows up this issue.
All our debugging so far points to the Processor thread on the broker
side which has a high CPU usage. There are N such Processor threads,
which always are in the RUNNABLE state doing this:
"kafka-network-thread-0-PLAINTEXT-0" #21 prio=5 os_prio=0
tid=0x00007f1858c4a800 nid=0xc81 runnable [0x00007f18106cb000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000006c0046128> (a sun.nio.ch.Util$2)
- locked <0x00000006c0046118> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006c0046068> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.kafka.common.network.Selector.select(Selector.java:425)
at org.apache.kafka.common.network.Selector.poll(Selector.java:254)
at kafka.network.Processor.run(SocketServer.scala:413)
at java.lang.Thread.run(Thread.java:745)
From what we have narrowed down so far, this thread in itself isn't a
"culprit", since when they are no consumers connected, the CPU isn't
high. However when a consumer connects to this and just waits for
messages, these threads start playing a role in the high CPU usage. Our
debugging shows that each of these X number of consumers that connect to
the broker keep doing 2 things when they are "idle":
1) A delayed operation every Y seconds which does the auto commit of
offsets.
2) Sending heartbeats every 3 seconds to the broker
We disabled auto commits of offsets since that's the semantic we wanted.
So #1 isn't really an issue. However, #2 is. It looks like the default
heartbeat interval is 3 seconds which is too low, IMO. This translates
to a network socket operation every 3 seconds which then has to be
processed by the broker side Processor thread. If there's just a single
consumer, this doesn't make much of a difference. As soon as you add
more consumers, the Processor on the broker side has to be start
processing each of these incoming heartbeats which become too frequent.
Even though the interval is 3 seconds, the incoming heartbeats to the
broker can be much more frequent when more consumers are involved since
the 3 second interval is just per consumer. So in practice there can be
a heartbeat coming every second or few milli seconds from the X
consumers to this broker which can contribute to this high CPU usage
when the system is practically idle.
So coming to the real question - why is the default heart beat interval
so low - 3 seconds? We increased it to 29 seconds (just 1 second less
than the session timeout) per consumer (via consumer configs) and in
addition to disabling auto commit, these changes have improved
noticeably the CPU usage.
Ideally, what would be a better value for the heart beat interval that
doesn't unnecessary flood these messages and cause the broker to
continuous process them?
-Jaikiran