[ https://issues.apache.org/jira/browse/KAFKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483849#comment-14483849 ]
Allen Wang commented on KAFKA-2096: ----------------------------------- [~junrao], yes I would like submit a patch. One thing to consider is whether we want to make this configurable. My understanding is that TCP keepalive should not affect client. The only side effect is increase of network traffic due to the probes. On the hand, making it configurable is less intrusive. > Enable keepalive socket option for broker to prevent socket leak > ---------------------------------------------------------------- > > Key: KAFKA-2096 > URL: https://issues.apache.org/jira/browse/KAFKA-2096 > Project: Kafka > Issue Type: Improvement > Components: network > Affects Versions: 0.8.2.1 > Reporter: Allen Wang > Assignee: Jun Rao > Priority: Critical > > We run a Kafka 0.8.2.1 cluster in AWS with large number of producers (> > 10000). Also the number of producer instances scale up and down significantly > on a daily basis. > The issue we found is that after 10 days, the open file descriptor count will > approach the limit of 32K. An investigation of these open file descriptors > shows that a significant portion of these are from client instances that are > terminated during scaling down. Somehow they still show as "ESTABLISHED" in > netstat. We suspect that the AWS firewall between the client and broker > causes this issue. > We attempted to use "keepalive" socket option to reduce this socket leak on > broker and it appears to be working. Specifically, we added this line to > kafka.network.Acceptor.accept(): > socketChannel.socket().setKeepAlive(true) > It is confirmed during our experiment of this change that entries in netstat > where the client instance is terminated were probed as configured in > operating system. After configured number of probes, the OS determined that > the peer is no longer alive and the entry is removed, possibly after an error > in Kafka to read from the channel and closing the channel. Also, our > experiment shows that after a few days, the instance was able to keep a > stable low point of open file descriptor count, compared with other instances > where the low point keeps increasing day to day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)