I can't speak to the exact details of why fds would be kept open longer in that specific case, but are you aware that the recommendation for production clusters for open fd limits is much higher? It's been suggested to be 100,000 as a starting point for quite awhile: http://kafka.apache.org/documentation.html#os
-Ewen On Mon, Jan 9, 2017 at 12:45 PM, Stephen Powis <spo...@salesforce.com> wrote: > Hey! > > I've run into something concerning in our production cluster....I believe > I've posted this question to the mailing list previously ( > http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/browser) > but the problem has become considerably more serious. > > We've been fighting issues where Kafka 0.10.0.1 hits its max file > descriptor limit. Our limit is set to ~16k, and under normal operation it > holds steady around 4k open files. > > But occasionally Kafka will roll a new log segment, which typically takes > on the order of magnitude of a few milliseconds. However...sometimes it > will take a considerable amount of time, any where from 40 seconds up to > over a minute. When this happens, it seems like connections are not > released by kafka, and we end up with thousands of client connections stuck > in CLOSE_WAIT, which pile up and exceed our max file descriptor limit. > This happens all in the span of about a minute. > > Our logs look like this: > > [2017-01-08 01:10:17,117] INFO Rolled new log segment for 'MyTopic-8' in > > 41122 ms. (kafka.log.Log) > > [2017-01-08 01:10:32,550] INFO Rolled new log segment for 'MyTopic-4' in > 1 > > ms. (kafka.log.Log) > > [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker 4]: > > Removed 0 expired offsets in 0 milliseconds. > > (kafka.coordinator.GroupMetadataManager) > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection > > (kafka.network.Acceptor) > > java.io.IOException: Too many open files at > > sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > > > at > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: > 422) > > at > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: > 250) > > at kafka.network.Acceptor.accept(SocketServer.scala:323) > > at kafka.network.Acceptor.run(SocketServer.scala:268) > > at java.lang.Thread.run(Thread.java:745) > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection > > (kafka.network.Acceptor) > > java.io.IOException: Too many open files > > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > > at > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: > 422) > > at > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: > 250) > > at kafka.network.Acceptor.accept(SocketServer.scala:323) > > at kafka.network.Acceptor.run(SocketServer.scala:268) > > at java.lang.Thread.run(Thread.java:745) > > ..... > > > > > And then kafka crashes. > > Has anyone seen this behavior of slow log segmented being rolled? Any > ideas of how to track down what could be causing this? > > Thanks! > Stephen >