Rong Tang created KAFKA-6375: -------------------------------- Summary: Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed. Key: KAFKA-6375 URL: https://issues.apache.org/jira/browse/KAFKA-6375 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.10.2.0 Environment: Windows, 23 brokers KafkaCluster Reporter: Rong Tang
Hi, I met with a case that in one broker, the out of sync replicas never catch up. When the broker starts up, it receives LeaderAndISR requests from controller, which will call createFetcherThread, the thread creation failed, with exceptions below. And then, there is no fetcher for these follower replicas, and it is out of sync forever. Unless, later, it receives LeaderAndISR requests that has higher leader EPOCH. Restart the broker can mitigate the issue. I have 2 questions. First, Why NEW ReplicaFetcherThread failed? *Second, shouldn't Kafka do something to fail over, instead of letting the broker in abnormal state.* It is a 23 brokers Kafka cluster running on Windows. each broker has 330 replicas. [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing LeaderAndIsr request with correlationId 1 received from controller 427703487 epoch 22 (state.change.logger) org.apache.kafka.common.KafkaException: java.io.IOException: *Unable to establish loopback connection at org.apache.kafka.common.network.Selector.<init>(Selector.java:124) at kafka.server.ReplicaFetcherThread.<init>(ReplicaFetcherThread.scala:87) at *kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35) at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83) at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78) at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869) at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689) at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149) at kafka.server.KafkaApis.handle(KafkaApis.scala:83) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to establish loopback connection at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94) at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61) at java.security.AccessController.doPrivileged(Native Method) at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:171) at sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50) at java.nio.channels.Pipe.open(Pipe.java:155) at sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:127) at sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44) at java.nio.channels.Selector.open(Selector.java:227) at org.apache.kafka.common.network.Selector.<init>(Selector.java:122) ... 16 more Caused by: java.net.ConnectException: Connection timed out: connect at sun.nio.ch.Net.connect0(Native Method) at sun.nio.ch.Net.connect(Net.java:454) at sun.nio.ch.Net.connect(Net.java:446) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) at java.nio.channels.SocketChannel.open(SocketChannel.java:189) at sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeImpl.java:127) at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:76) ... 25 more -- This message was sent by Atlassian JIRA (v6.4.14#64029)