[ https://issues.apache.org/jira/browse/KAFKA-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486355#comment-14486355 ]
Allen Wang commented on KAFKA-2096: ----------------------------------- To verify the fix, the socket connections on broker shown from netstat -o should have "keepalive" in the end of the line: tcp6 0 0 xyz-:7101 ip-10-81-144-131.:48779 ESTABLISHED keepalive (7111.94/0/0) > Enable keepalive socket option for broker to prevent socket leak > ---------------------------------------------------------------- > > Key: KAFKA-2096 > URL: https://issues.apache.org/jira/browse/KAFKA-2096 > Project: Kafka > Issue Type: Improvement > Components: network > Affects Versions: 0.8.2.1 > Reporter: Allen Wang > Assignee: Jun Rao > Priority: Critical > Attachments: patch.diff > > > We run a Kafka 0.8.2.1 cluster in AWS with large number of producers (> > 10000). Also the number of producer instances scale up and down significantly > on a daily basis. > The issue we found is that after 10 days, the open file descriptor count will > approach the limit of 32K. An investigation of these open file descriptors > shows that a significant portion of these are from client instances that are > terminated during scaling down. Somehow they still show as "ESTABLISHED" in > netstat. We suspect that the AWS firewall between the client and broker > causes this issue. > We attempted to use "keepalive" socket option to reduce this socket leak on > broker and it appears to be working. Specifically, we added this line to > kafka.network.Acceptor.accept(): > socketChannel.socket().setKeepAlive(true) > It is confirmed during our experiment of this change that entries in netstat > where the client instance is terminated were probed as configured in > operating system. After configured number of probes, the OS determined that > the peer is no longer alive and the entry is removed, possibly after an error > in Kafka to read from the channel and closing the channel. Also, our > experiment shows that after a few days, the instance was able to keep a > stable low point of open file descriptor count, compared with other instances > where the low point keeps increasing day to day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)