Rajdeep Mukherjee created KAFKA-8714:
----------------------------------------

             Summary: CLOSE_WAIT connections piling up on the broker
                 Key: KAFKA-8714
                 URL: https://issues.apache.org/jira/browse/KAFKA-8714
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.3.0, 0.10.1.0
         Environment: Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            Reporter: Rajdeep Mukherjee
         Attachments: Screenshot from 2019-07-25 11-53-24.png, 
consumer_multiprocessing.py, producer_multiprocessing.py

We are experiencing an issue where `CLOSE_WAIT` connections are piling up in 
the brokers leading to a `Too many open files` error finally leading to a crash 
of the corresponding broker. After some digging, we realized that this is 
happening at instances when multiple clients(producers or consumers) are 
closing their connections within a brief interval of time(when the frequency of 
client connection closes is increasing). 

The actual error that we had encountered was:
{code:java}
[2019-07-18 00:03:27,861] ERROR Error while accepting connection
(kafka.network.Acceptor) java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) 
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) 
at kafka.network.Acceptor.accept(SocketServer.scala:326)
at kafka.network.Acceptor.run(SocketServer.scala:269)
at java.lang.Thread.run(Thread.java:745)
{code}
When the error was encountered, the number of CLOSE_WAIT connections on the 
broker was 200,000 and the number of ESTABLISHED connections was approximately 
15000.

The attachment shows the issue, the sharp dip in the graph is the point where 
the broker restarted.

We had encountered this problem in both kafka version 0.10.1 and 2.3.0

The client version we were using for reproducing was:

 
{code:java}
confluent-kafka==1.1.0
librdkafka v1.1.0
{code}
 

Steps to reproduce:

I have attached the scripts we used for reproducing the issue. 

In our qa environment we were successfully able to reproduce the issue in the 
following way:
 * we spun a 5 node kafka v2.3.0 cluster
 * we had prepared a python script that would spin in the order of 500+ 
producer processes and the same number of consumer processes and we had written 
in logic to randomly close the producer and consumer connections at a high 
frequency in the order of 10 closes per second for 5 minutes.
 * On the broker side, we were watching for CLOSE_WAIT connections using `lsof` 
and we got sustained CLOSE_WAIT connections that persisted until we restarted 
kafka on the corresponding broker.

The command to be run for the producer and consumer scripts are:
{code:java}
python producer_multiprocessing.py <time in seconds> <number of processes 
<sleep in seconds between produce> true true

python consumer_multiprocessing.py <time in seconds> <number of processes> 0 
true
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to