[ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shulgin resolved KAFKA-16054.
---------------------------------------
    Resolution: Not A Bug

> Sudden 100% CPU on a broker
> ---------------------------
>
>                 Key: KAFKA-16054
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16054
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 3.3.2, 3.6.1
>         Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB,  Amazon 
> Linux
>            Reporter: Oleksandr Shulgin
>            Priority: Critical
>              Labels: linux
>
> We have observed now for the 3rd time in production the issue where a Kafka 
> broker will suddenly jump to 100% CPU usage and will not recover on its own 
> until manually restarted.
> After a deeper investigation, we now believe that this is an instance of the 
> infamous epoll bug. See:
> [https://github.com/netty/netty/issues/327]
> [https://github.com/netty/netty/pull/565] (original workaround)
> [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
>  (same workaround in the current Netty code)
> The first occurrence in our production environment was on 2023-08-26 and the 
> other two — on 2023-12-10 and 2023-12-20.
> Each time the high CPU issue is also resulting in this other issue (misplaced 
> messages) I was asking about on the users mailing list in September, but to 
> date got not a single reply, unfortunately: 
> [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]
> We still do not know how this other issue is happening.
> When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
> threads consuming ~60% user and ~40% system CPU, and the thread dump contains 
> a lot of stack traces like the following one:
> "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10"
>  #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s 
> tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000]
> java.lang.Thread.State: RUNNABLE
> #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
> #011at 
> sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
> #011at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
> #011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
> #011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
> #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
> #011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
> #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
> #011at kafka.network.Processor.poll(SocketServer.scala:1107)
> #011at kafka.network.Processor.run(SocketServer.scala:1011)
> #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
> At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
> consider tuning tcp_mem".
> We are running relatively big machines in production — c6g.4xlarge with 30 GB 
> RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
> 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
> memory pages.
> We were able to reproduce the issue in our test environment (which is using 
> 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: 
> "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of 
> the busy Kafka threads shows the following syscalls repeating constantly:
> epoll_pwait(15558, [\\\\\{events=EPOLLOUT, data={u32=12286, 
> u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/],
>  1024, 300, NULL, 8) = 1
> fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
> sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
> unavailable)
> Resetting the "tcp_mem" parameters back to the auto-configured values in the 
> test environment removes the pressure from the broker and it can continue 
> normally without restart.
> We have found a bug report here that suggests that an issue may be partially 
> due to a kernel bug: 
> [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
> (they are using version 5.15)
> We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to 
> reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
> parameters by a factor of 1,000. The JVM behavior is the same under these 
> conditions.
> A similar issue is reported here, affecting Kafka Connect:
> https://issues.apache.org/jira/browse/KAFKA-4739
> Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue 
> is present on both systems.
> The issue is also reproducible on JDK 11 (as you can see from the stack 
> trace, we are using 17).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to