Hi Jun, That was the problem. It was actually the Ubuntu upstart job over writing the limit. Thank you very much for your help.
Paul Lung On 7/9/14, 1:58 PM, "Jun Rao" <jun...@gmail.com> wrote: >Is it possible your container wrapper somehow overrides the file handler >limit? > >Thanks, > >Jun > > >On Wed, Jul 9, 2014 at 9:59 AM, Lung, Paul <pl...@ebay.com> wrote: > >> Yup. In fact, I just ran the test program again while the Kafak broker >>is >> still running, using the same user of course. I was able to get up to >>10K >> connections with the test program. The test program uses the same java >>NIO >> library that the broker does. So the machine is capable of handling that >> many connections. The only issue I saw was that the NIO >> ServerSocketChannel is a bit slow at accepting connections when the >>total >> connection goes around 4K, but this could be due to the fact that I put >> the ServerSocketChannel in the same Selector as the 4K SocketChannels. >>So >> sometimes on the client side, I see: >> >> java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcher.write0(Native Method) >> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:122) >> at sun.nio.ch.IOUtil.write(IOUtil.java:93) >> at >>sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:352) >> at FdTest$ClientThread.run(FdTest.java:108) >> >> >> But all I have to do is sleep for a bit on the client, and then retry >> again. However, 4K does seem like a magic number, since that¹s seems to >>be >> the number that the Kafka broker machine can handle before it gives me >>the >> ³Too Many Open Files² error and eventually crashes. >> >> Paul Lung >> >> On 7/8/14, 9:29 PM, "Jun Rao" <jun...@gmail.com> wrote: >> >> >Does your test program run as the same user as Kafka broker? >> > >> >Thanks, >> > >> >Jun >> > >> > >> >On Tue, Jul 8, 2014 at 1:42 PM, Lung, Paul <pl...@ebay.com> wrote: >> > >> >> Hi Guys, >> >> >> >> I¹m seeing the following errors from the 0.8.1.1 broker. This occurs >> >>most >> >> often on the Controller machine. Then the controller process crashes, >> >>and >> >> the controller bounces to other machines, which causes those >>machines to >> >> crash. Looking at the file descriptors being held by the process, >>it¹s >> >>only >> >> around 4000 or so(looking at . There aren¹t a whole lot of >>connections >> >>in >> >> TIME_WAIT states, and I¹ve increased the ephemeral port range to >>³16000 >> >> >> >> 64000² via "/proc/sys/net/ipv4/ip_local_port_range². I¹ve written a >>Java >> >> test program to see how many sockets and files I can open. The >>socket is >> >> definitely limited by the ephemeral port range, which was around 22K >>at >> >>the >> >> time. But I >> >> can open tons of files, since the open file limit of the user is set >>to >> >> 100K. >> >> >> >> So given that I can theoretically open 48K sockets and probably 90K >> >>files, >> >> and I only see around 4K total for the Kafka broker, I¹m really >> >>confused as >> >> to why I¹m seeing this error. Is there some internal Kafka limit >>that I >> >> don¹t know about? >> >> >> >> Paul Lung >> >> >> >> >> >> >> >> java.io.IOException: Too many open files >> >> >> >> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >> >> >> >> at >> >> >> >>>>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: >>>>16 >> >>3) >> >> >> >> at kafka.network.Acceptor.accept(SocketServer.scala:200) >> >> >> >> at kafka.network.Acceptor.run(SocketServer.scala:154) >> >> >> >> at java.lang.Thread.run(Thread.java:679) >> >> >> >> [2014-07-08 13:07:21,534] ERROR Error in acceptor >> >>(kafka.network.Acceptor) >> >> >> >> java.io.IOException: Too many open files >> >> >> >> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >> >> >> >> at >> >> >> >>>>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java: >>>>16 >> >>3) >> >> >> >> at kafka.network.Acceptor.accept(SocketServer.scala:200) >> >> >> >> at kafka.network.Acceptor.run(SocketServer.scala:154) >> >> >> >> at java.lang.Thread.run(Thread.java:679) >> >> >> >> [2014-07-08 13:07:21,563] ERROR [ReplicaFetcherThread-3-2124488], >>Error >> >> for partition >>[bom__021____active_80__32__mini____activeitem_lvs_qn,0] >> >>to >> >> broker 2124488:class kafka.common.NotLeaderForPartitionException >> >> (kafka.server.ReplicaFetcherThread) >> >> >> >> [2014-07-08 13:07:21,558] FATAL [Replica Manager on Broker 2140112]: >> >>Error >> >> writing to highwatermark file: (kafka.server.ReplicaManager) >> >> >> >> java.io.FileNotFoundException: >> >> >> >>>>/ebay/cronus/software/cronusapp_home/kafka/kafka-logs/replication-offse >>>>t- >> >>checkpoint.tmp >> >> (Too many open files) >> >> >> >> at java.io.FileOutputStream.open(Native Method) >> >> >> >> at java.io.FileOutputStream.<init>(FileOutputStream.java:209) >> >> >> >> at java.io.FileOutputStream.<init>(FileOutputStream.java:160) >> >> >> >> at java.io.FileWriter.<init>(FileWriter.java:90) >> >> >> >> at >> >>kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) >> >> >> >> at >> >> >> >>>>kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$2.apply(R >>>>ep >> >>licaManager.scala:447) >> >> >> >> at >> >> >> >>>>kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$2.apply(R >>>>ep >> >>licaManager.scala:444) >> >> >> >> at >> >> >> >>>>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Tr >>>>av >> >>ersableLike.scala:772) >> >> >> >> at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) >> >> >> >> at >> >> >> >>>>scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.sca >>>>la >> >>:771) >> >> >> >> at >> >> >> >>>>kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.sca >>>>la >> >>:444) >> >> >> >> at >> >> >> >>>>kafka.server.ReplicaManager$$anonfun$1.apply$mcV$sp(ReplicaManager.scal >>>>a: >> >>94) >> >> >> >> at >> >>kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100) >> >> >> >> at >> >> >>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> >> >> >> at >> >> >> >>>>java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:3 >>>>51 >> >>) >> >> >> >> at >> >>java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >> >> >> >> at >> >> >> >>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.ac >>>>ce >> >>ss$201(ScheduledThreadPoolExecutor.java:165) >> >> >> >> at >> >> >> >>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.ru >>>>n( >> >>ScheduledThreadPoolExecutor.java:267) >> >> >> >> at >> >> >> >>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.ja >>>>va >> >>:1110) >> >> >> >> at >> >> >> >>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j >>>>av >> >>a:603) >> >> >> >> at java.lang.Thread.run(Thread.java:679) >> >> >> >> >> >> >> >> >> >>