[ https://issues.apache.org/jira/browse/KAFKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202488#comment-14202488 ]
Jon Riegel commented on KAFKA-1460: ----------------------------------- We experienced this issue with a production deployment, running version 0.8.1. The cluster is configured with 3 brokers, and zookeeper is running as a 3-node cluster on the same hardware (3 different m3xl EC2 instances). One of the hosts (which had been acting as zookeeper leader as well as kafka controller) experienced a hardware failure. The new controller attempted to initiate preferred replica elections for each of our topic and partitions, and received the above error for all of them. Subsequently the cluster entered an unrecoverable bad state. By design, the dead broker should have no longer been the leader of any partitions, and it should have been deleted from the ISR set from all partitions; instead, the ISR shrunk for the partitions led by other brokers, but NOT for the partitions that had been led by the failed broker. Our producers are configured with request.required.acks=2, so about half of their messages were received and acknowledged by the remaining two nodes while the failed node was down. When the 3rd node was brought back up, it was unable to join the cluster. Immediately after startup, it begain repeatedly logging these two warnings for each topic-partition that it had led: [2014-11-04 21:51:47,724] WARN [Replica Manager on Broker 1]: While recording the follower position, the partition [prod.request-performance,3] hasn't been created, skip updating leader HW (kafka.server.ReplicaManager) [2014-11-04 21:51:59,123] WARN [KafkaApi-1] Fetch request with correlation id 35061758 from client ReplicaFetcherThread-0-1 on partition [prod.action-rule-log,1] failed due to Topic prod.action-rule-log either doesn't exist or is in the process of being deleted (kafka.server.KafkaApis) Furthermore, when we attempted to restart the kafka process on a different broker, that broker *also* experienced the same problems. At that point, pressed for time, we shut the entire cluster down and restarted it with fresh data directories, and it normal operations resumed; in the meantime, some of our data was lost. I hope this comment draws more attention to this issue; it's a bit disturbing that an issue marked "critical" for months appears to have had no investigation. I can provide more detailed logs or configuration details if desired. > NoReplicaOnlineException: No replica for partition > -------------------------------------------------- > > Key: KAFKA-1460 > URL: https://issues.apache.org/jira/browse/KAFKA-1460 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.1.1 > Reporter: Artur Denysenko > Priority: Critical > Attachments: state-change.log > > > We have a standalone kafka server. > After several days of running we get: > {noformat} > kafka.common.NoReplicaOnlineException: No replica for partition > [gk.q.module,1] is alive. Live brokers are: [Set()], Assigned replicas are: > [List(0)] > at > kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) > at > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) > at > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) > at > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) > at > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > at > scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742) > at > kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96) > at > kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68) > at > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312) > at > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63) > at > kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1068) > at > kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1066) > at > kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1066) > at kafka.utils.Utils$.inLock(Utils.scala:538) > at > kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1066) > at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {noformat} > Please see attached [state-change.log] > You can find all server logs (450mb) here: > http://46.4.114.35:9999/deploy/kafka-logs.2014-05-14-16.tgz > On client we get: > {noformat} > 16:28:36,843 [ool-12-thread-2] WARN ZookeeperConsumerConnector - > [dev_dev-1400257716132-e7b8240c], no brokers found when trying to rebalance. > {noformat} > If we try to send message using 'kafka-console-producer.sh': > {noformat} > [root@dev kafka]# /srv/kafka/bin/kafka-console-producer.sh --broker-list > localhost:9092 --topic test > message > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > [2014-05-16 19:45:30,950] WARN Fetching topic metadata with correlation id 0 > for topics [Set(test)] from broker [id:0,host:localhost,port:9092] failed > (kafka.client.ClientUtils$) > java.net.SocketTimeoutException > at > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:229) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > at > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) > at kafka.utils.Utils$.read(Utils.scala:375) > at > kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) > at kafka.network.Receive$class.readCompletely(Transmission.scala:56) > at > kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29) > at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100) > at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:74) > at > kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:71) > at kafka.producer.SyncProducer.send(SyncProducer.scala:112) > at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53) > at > kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82) > at > kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:67) > at kafka.utils.Utils$.swallow(Utils.scala:167) > at kafka.utils.Logging$class.swallowError(Logging.scala:106) > at kafka.utils.Utils$.swallowError(Utils.scala:46) > at > kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:67) > at > kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:104) > at > kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:87) > at > kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:67) > at scala.collection.immutable.Stream.foreach(Stream.scala:526) > at > kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:66) > at > kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:44) > {noformat} > If we try to receive message using 'kafka-console-consumer.sh': > {noformat} > [root@dev kafka]# /srv/kafka/bin/kafka-console-consumer.sh --zookeeper > localhost:2181 --topic test > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > [2014-05-16 19:46:23,029] WARN > [console-consumer-69449_dev-1400262382648-1c9bfcd3], no brokers found when > trying to rebalance. (kafka.consumer.ZookeeperConsumerConnector) > {noformat} > Port 9092 is open: > {noformat} > [root@dev kafka]# telnet localhost 9092 > Trying 127.0.0.1... > Connected to localhost. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)