[ 
https://issues.apache.org/jira/browse/KAFKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202488#comment-14202488
 ] 

Jon Riegel commented on KAFKA-1460:
-----------------------------------

We experienced this issue with a production deployment, running version 0.8.1.

The cluster is configured with 3 brokers, and zookeeper is running as a 3-node 
cluster on the same hardware (3 different m3xl EC2 instances).  One of the 
hosts (which had been acting as zookeeper leader as well as kafka controller) 
experienced a hardware failure.  The new controller attempted to initiate 
preferred replica elections for each of our topic and partitions, and received 
the above error for all of them.

Subsequently the cluster entered an unrecoverable bad state.  By design, the 
dead broker should have no longer been the leader of any partitions, and it 
should have been deleted from the ISR set from all partitions; instead, the ISR 
shrunk for the partitions led by other brokers, but NOT for the partitions that 
had been led by the failed broker.  Our producers are configured with 
request.required.acks=2, so about half of their messages were received and 
acknowledged by the remaining two nodes while the failed node was down.  

When the 3rd node was brought back up, it was unable to join the cluster.  
Immediately after startup, it begain repeatedly logging these two warnings for 
each topic-partition that it had led:

[2014-11-04 21:51:47,724] WARN [Replica Manager on Broker 1]: While recording 
the follower position, the partition [prod.request-performance,3] hasn't been 
created, skip updating leader HW (kafka.server.ReplicaManager)

[2014-11-04 21:51:59,123] WARN [KafkaApi-1] Fetch request with correlation id 
35061758 from client ReplicaFetcherThread-0-1 on partition 
[prod.action-rule-log,1] failed due to Topic prod.action-rule-log either 
doesn't exist or is in the process of being deleted (kafka.server.KafkaApis)

Furthermore, when we attempted to restart the kafka process on a different 
broker, that broker *also* experienced the same problems.  At that point, 
pressed for time, we shut the entire cluster down and restarted it with fresh 
data directories, and it normal operations resumed; in the meantime, some of 
our data was lost.

I hope this comment draws more attention to this issue; it's a bit disturbing 
that an issue marked "critical" for months appears to have had no 
investigation.  I can provide more detailed logs or configuration details if 
desired.

> NoReplicaOnlineException: No replica for partition
> --------------------------------------------------
>
>                 Key: KAFKA-1460
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1460
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.1.1
>            Reporter: Artur Denysenko
>            Priority: Critical
>         Attachments: state-change.log
>
>
> We have a standalone kafka server.
> After several days of running we get:
> {noformat}
> kafka.common.NoReplicaOnlineException: No replica for partition 
> [gk.q.module,1] is alive. Live brokers are: [Set()], Assigned replicas are: 
> [List(0)]
>       at 
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
>       at 
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
>       at 
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
>       at 
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
>       at 
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
>       at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
>       at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
>       at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:772)
>       at 
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
>       at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
>       at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
>       at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
>       at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742)
>       at 
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96)
>       at 
> kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68)
>       at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312)
>       at 
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162)
>       at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63)
>       at 
> kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1068)
>       at 
> kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1066)
>       at 
> kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1066)
>       at kafka.utils.Utils$.inLock(Utils.scala:538)
>       at 
> kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1066)
>       at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472)
>       at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {noformat}
> Please see attached [state-change.log]
> You can find all server logs (450mb) here: 
> http://46.4.114.35:9999/deploy/kafka-logs.2014-05-14-16.tgz
> On client we get:
> {noformat}
> 16:28:36,843 [ool-12-thread-2] WARN  ZookeeperConsumerConnector - 
> [dev_dev-1400257716132-e7b8240c], no brokers found when trying to rebalance.
> {noformat}
> If we try to send message using 'kafka-console-producer.sh':
> {noformat}
> [root@dev kafka]# /srv/kafka/bin/kafka-console-producer.sh --broker-list 
> localhost:9092 --topic test
> message
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> [2014-05-16 19:45:30,950] WARN Fetching topic metadata with correlation id 0 
> for topics [Set(test)] from broker [id:0,host:localhost,port:9092] failed 
> (kafka.client.ClientUtils$)
> java.net.SocketTimeoutException
>         at 
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:229)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>         at 
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
>         at kafka.utils.Utils$.read(Utils.scala:375)
>         at 
> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
>         at 
> kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
>         at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
>         at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:74)
>         at 
> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:71)
>         at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
>         at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
>         at 
> kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
>         at 
> kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:67)
>         at kafka.utils.Utils$.swallow(Utils.scala:167)
>         at kafka.utils.Logging$class.swallowError(Logging.scala:106)
>         at kafka.utils.Utils$.swallowError(Utils.scala:46)
>         at 
> kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:67)
>         at 
> kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:104)
>         at 
> kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:87)
>         at 
> kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:67)
>         at scala.collection.immutable.Stream.foreach(Stream.scala:526)
>         at 
> kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:66)
>         at 
> kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:44)
> {noformat}
> If we try to receive message using 'kafka-console-consumer.sh':
> {noformat}
> [root@dev kafka]# /srv/kafka/bin/kafka-console-consumer.sh --zookeeper 
> localhost:2181 --topic test
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> [2014-05-16 19:46:23,029] WARN 
> [console-consumer-69449_dev-1400262382648-1c9bfcd3], no brokers found when 
> trying to rebalance. (kafka.consumer.ZookeeperConsumerConnector)
> {noformat}
> Port 9092 is open:
> {noformat}
> [root@dev kafka]# telnet localhost 9092
> Trying 127.0.0.1...
> Connected to localhost.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to