Rajdeep Mukherjee created KAFKA-8714: ----------------------------------------
Summary: CLOSE_WAIT connections piling up on the broker Key: KAFKA-8714 URL: https://issues.apache.org/jira/browse/KAFKA-8714 Project: Kafka Issue Type: Bug Affects Versions: 2.3.0, 0.10.1.0 Environment: Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Reporter: Rajdeep Mukherjee Attachments: Screenshot from 2019-07-25 11-53-24.png, consumer_multiprocessing.py, producer_multiprocessing.py We are experiencing an issue where `CLOSE_WAIT` connections are piling up in the brokers leading to a `Too many open files` error finally leading to a crash of the corresponding broker. After some digging, we realized that this is happening at instances when multiple clients(producers or consumers) are closing their connections within a brief interval of time(when the frequency of client connection closes is increasing). The actual error that we had encountered was: {code:java} [2019-07-18 00:03:27,861] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:326) at kafka.network.Acceptor.run(SocketServer.scala:269) at java.lang.Thread.run(Thread.java:745) {code} When the error was encountered, the number of CLOSE_WAIT connections on the broker was 200,000 and the number of ESTABLISHED connections was approximately 15000. The attachment shows the issue, the sharp dip in the graph is the point where the broker restarted. We had encountered this problem in both kafka version 0.10.1 and 2.3.0 The client version we were using for reproducing was: {code:java} confluent-kafka==1.1.0 librdkafka v1.1.0 {code} Steps to reproduce: I have attached the scripts we used for reproducing the issue. In our qa environment we were successfully able to reproduce the issue in the following way: * we spun a 5 node kafka v2.3.0 cluster * we had prepared a python script that would spin in the order of 500+ producer processes and the same number of consumer processes and we had written in logic to randomly close the producer and consumer connections at a high frequency in the order of 10 closes per second for 5 minutes. * On the broker side, we were watching for CLOSE_WAIT connections using `lsof` and we got sustained CLOSE_WAIT connections that persisted until we restarted kafka on the corresponding broker. The command to be run for the producer and consumer scripts are: {code:java} python producer_multiprocessing.py <time in seconds> <number of processes <sleep in seconds between produce> true true python consumer_multiprocessing.py <time in seconds> <number of processes> 0 true {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)