Hi, We also facing the same issue when we upgraded our cluste to 0.11.0.1. Can anyone please help us out. It happened in production cluster.
Thanks Madhukar On Fri, 15 Sep 2017 at 9:07 PM, Lukas Lalinsky <lukas.lalin...@exponea.com> wrote: > Hello, > > I'm dealing with a strange issue in production and I'm running out of > options what to do about it. > > It's a 3 node cluster running Kafka 0.11.0.1 with most topics having > replication factor of 2. At some point, the broker that is about do > die shrinks ISR for a few partitions just to itself: > > [2017-09-15 11:25:29,104] INFO Partition [...,12] on broker 3: > Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition) > [2017-09-15 11:25:29,107] INFO Partition [...,8] on broker 3: > Shrinking ISR from 3,1 to 3 (kafka.cluster.Partition) > [2017-09-15 11:25:29,108] INFO Partition [...,38] on broker 3: > Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition) > > Then slightly after that, another broker writes errors like this to > the log file: > > [2017-09-15 11:25:45,536] WARN [ReplicaFetcherThread-0-3]: Error in > fetch to broker 3, request (type=FetchRequest, replicaId=2, > maxWait=500, minBytes=1, maxBytes=10485760, fetchData={...}) > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 3 was disconnected before the > response was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > > There are many of such messages. At that point, I see the number of > open file descriptors on the other broker growing. And eventually it > crashes with thousands of messages like this: > > [2017-09-15 11:31:23,273] ERROR Error while accepting connection > (kafka.network.Acceptor) > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch > .ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at sun.nio.ch > .ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:337) > at kafka.network.Acceptor.run(SocketServer.scala:280) > at java.lang.Thread.run(Thread.java:745) > > The file descriptor limit is set to 128k, the number of open file > descriptors during normal operation is about 8k, so there is a lot of > headroom. > > I'm not sure if it's the other brokers trying to replicate that kills > it, or whether it's clients trying to publish messages. > > Has anyone seen a behavior like this? I'd appreciate any pointers. > > Thanks, > > Lukas > -- Thanks and Regards, Madhukar Bharti Mob: 7845755539