We are running 3 kafka nodes, which servers 4 partition. We have been experiencing weird behavior during network outage.
we had been experiencing twice in the last couple of days. the previous one took down all of the cluster. while this one only 2 out of 3 survive. and 1 node became the leader of all partition, and other node only in ISR of 1 partition (out of 4) my best guess now is that when the network down, the broker can't connect to other broker to do replication and keep opening the socket without closing it. But I'm not entirely sure about this. Is there any way to mitigate the problem ? or is there any configuration options to stop this from happening again ? The java/kafka process open too many socket file descriptor. running `lsof -a -p 11818` yield thousand of this line. ... java 11818 kafka 3059u sock 0,7 0t0 615637305 can't identify protocol java 11818 kafka 3060u sock 0,7 0t0 615637306 can't identify protocol java 11818 kafka 3061u sock 0,7 0t0 615637307 can't identify protocol java 11818 kafka 3062u sock 0,7 0t0 615637308 can't identify protocol java 11818 kafka 3063u sock 0,7 0t0 615637309 can't identify protocol java 11818 kafka 3064u sock 0,7 0t0 615637310 can't identify protocol java 11818 kafka 3065u sock 0,7 0t0 615637311 can't identify protocol ... i verify that the the open socket did not close when i repeated the command after 2 minutes. and the kafka log on the broken node, generate lots of error like this: [2014-01-21 04:21:48,819] 64573925 [kafka-acceptor] ERROR kafka.network.Acceptor - Error in acceptor java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165) at kafka.network.Acceptor.accept(SocketServer.scala:200) at kafka.network.Acceptor.run(SocketServer.scala:154) at java.lang.Thread.run(Thread.java:701) [2014-01-21 04:21:48,819] 64573925 [kafka-acceptor] ERROR kafka.network.Acceptor - Error in acceptor java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165) at kafka.network.Acceptor.accept(SocketServer.scala:200) at kafka.network.Acceptor.run(SocketServer.scala:154) at java.lang.Thread.run(Thread.java:701) [2014-01-21 04:21:48,811] 64573917 [ReplicaFetcherThread-0-1] INFO kafka.consumer.SimpleConsumer - Reconnect due to socket error: null [2014-01-21 04:21:48,819] 64573925 [ReplicaFetcherThread-0-1] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-1], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 74930218; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [some-topic,0] -> PartitionFetchInfo(959825,1048576),[some-topic,3] -> PartitionFetchInfo(551546,1048576) java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method) at sun.nio.ch.Net.socket(Net.java:156) at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:102) at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:55) at java.nio.channels.SocketChannel.open(SocketChannel.java:122) at kafka.network.BlockingChannel.connect(BlockingChannel.scala:48) at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44) at kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:110) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:108) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:94) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:86) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) -- Ahmy Yulrizka http://ahmy.yulrizka.com @yulrizka