[ https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oleksandr Shulgin resolved KAFKA-16054. --------------------------------------- Resolution: Not A Bug > Sudden 100% CPU on a broker > --------------------------- > > Key: KAFKA-16054 > URL: https://issues.apache.org/jira/browse/KAFKA-16054 > Project: Kafka > Issue Type: Bug > Components: network > Affects Versions: 3.3.2, 3.6.1 > Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB, Amazon > Linux > Reporter: Oleksandr Shulgin > Priority: Critical > Labels: linux > > We have observed now for the 3rd time in production the issue where a Kafka > broker will suddenly jump to 100% CPU usage and will not recover on its own > until manually restarted. > After a deeper investigation, we now believe that this is an instance of the > infamous epoll bug. See: > [https://github.com/netty/netty/issues/327] > [https://github.com/netty/netty/pull/565] (original workaround) > [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] > (same workaround in the current Netty code) > The first occurrence in our production environment was on 2023-08-26 and the > other two — on 2023-12-10 and 2023-12-20. > Each time the high CPU issue is also resulting in this other issue (misplaced > messages) I was asking about on the users mailing list in September, but to > date got not a single reply, unfortunately: > [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] > We still do not know how this other issue is happening. > When the high CPU happens, top(1) reports a number of "data-plane-kafka..." > threads consuming ~60% user and ~40% system CPU, and the thread dump contains > a lot of stack traces like the following one: > "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" > #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s > tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000] > java.lang.Thread.State: RUNNABLE > #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) > #011at > sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) > #011at > sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) > #011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2) > #011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl) > #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) > #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) > #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) > #011at kafka.network.Processor.poll(SocketServer.scala:1107) > #011at kafka.network.Processor.run(SocketServer.scala:1011) > #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) > At the same time the Linux kernel reports repeatedly "TCP: out of memory – > consider tuning tcp_mem". > We are running relatively big machines in production — c6g.4xlarge with 30 GB > RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 > 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB > memory pages. > We were able to reproduce the issue in our test environment (which is using > 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: > "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of > the busy Kafka threads shows the following syscalls repeating constantly: > epoll_pwait(15558, [\\\\\{events=EPOLLOUT, data={u32=12286, > u64=4681111381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=4681111381628432382}}/], > 1024, 300, NULL, 8) = 1 > fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 > fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 > sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily > unavailable) > Resetting the "tcp_mem" parameters back to the auto-configured values in the > test environment removes the pressure from the broker and it can continue > normally without restart. > We have found a bug report here that suggests that an issue may be partially > due to a kernel bug: > [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] > (they are using version 5.15) > We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to > reproduce the issue, but we can still do it by reducing all the of "tcp_mem" > parameters by a factor of 1,000. The JVM behavior is the same under these > conditions. > A similar issue is reported here, affecting Kafka Connect: > https://issues.apache.org/jira/browse/KAFKA-4739 > Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue > is present on both systems. > The issue is also reproducible on JDK 11 (as you can see from the stack > trace, we are using 17). -- This message was sent by Atlassian Jira (v8.20.10#820010)