[ https://issues.apache.org/jira/browse/KAFKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arup Malakar updated KAFKA-1466: -------------------------------- Attachment: kafka.log.1 [~jjkoshy] Some more info: 1. The topic we use for actual prod messages is not test_topic and has *two replicas*. We were unable to push messages to that topic, so the cluster was indeed unavailable: {code} topic: staging_thrift_streaming partition: 0 leader: 2 replicas: 4,2 isr: 2,4 topic: staging_thrift_streaming partition: 1 leader: 1 replicas: 1,3 isr: 1,3 topic: staging_thrift_streaming partition: 2 leader: 2 replicas: 2,4 isr: 2,4 ...... {code} 2. More info: Java version: {code} java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) {code} OS Version: {code} ~# uname -a Linux ip-X-X-X-X 3.2.0-51-virtual #77-Ubuntu SMP Wed Jul 24 20:38:32 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux ~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04 LTS Release: 12.04 Codename: precise {code} I couldn't find anything strange in the kernel logs though. I am attaching the kafka logs here. > Kafka server is hung after throwing "Attempt to swap the new high watermark > file with the old one failed" > --------------------------------------------------------------------------------------------------------- > > Key: KAFKA-1466 > URL: https://issues.apache.org/jira/browse/KAFKA-1466 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.0 > Reporter: Arup Malakar > Attachments: kafka.log.1 > > > We have a kafka cluster of four nodes. The cluster was down after one of the > nodes threw the following error: > 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: > HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file > with the old one failed. I saw the following message in the log file of the > failed node: > {code} > 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: > HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file > with the old one failed > 2014-05-21 23:19:44 INFO [Thread-1]: KafkaServer:67 - [Kafka Server 4], > Shutting down > 2014-05-21 23:19:44 INFO [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper > client... > 2014-05-21 23:19:44 INFO > [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u > s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread. > 2014-05-21 23:19:44 INFO [main-EventThread]: ClientCnxn:521 - EventThread > shut down > 2014-05-21 23:19:44 INFO [Thread-1]: ZooKeeper:544 - Session: > 0x1456b562865b172 closed > 2014-05-21 23:19:44 INFO [kafka-processor-9092-0]: Processor:67 - Closing > socket connection to /10.245.173.136. > 2014-05-21 23:19:44 INFO [Thread-1]: SocketServer:67 - [Socket Server on > Broker 4], Shutting down > 2014-05-21 23:19:44 INFO [Thread-1]: SocketServer:67 - [Socket Server on > Broker 4], Shutdown completed > 2014-05-21 23:19:44 INFO [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka > Request Handler on Broker 4], shutting down > 2014-05-21 23:19:44 INFO [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka > Request Handler on Broker 4], shutted down completely > 2014-05-21 23:19:44 INFO [Thread-1]: KafkaScheduler:67 - Shutdown Kafka > scheduler > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaManager:67 - [Replica Manager on > Broker 4]: Shut down > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherManager:67 - > [ReplicaFetcherManager on broker 4] shutting down > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherThread:67 - > [ReplicaFetcherThread-0-3], Shutting down > 2014-05-21 23:19:45 INFO [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 > - [ReplicaFetcherThread-0-3], Stopped > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherThread:67 - > [ReplicaFetcherThread-0-3], Shutdown completed > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherThread:67 - > [ReplicaFetcherThread-0-2], Shutting down > 2014-05-21 23:19:45 INFO [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 > - [ReplicaFetcherThread-0-2], Stopped > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherThread:67 - > [ReplicaFetcherThread-0-2], Shutdown completed > 2014-05-21 23:19:45 INFO [Thread-1]: ReplicaFetcherManager:67 - > [ReplicaFetcherManager on broker 4] shutdown completed > {code} > I notice that since this error was logged there weren't any more logs in the > log file but the process was still alive, so I guess it was hung. > The other nodes in the cluster was not able to recover from this error. The > partitions owned by this failed node had its leader set to -1: > {code} > topic: test_topic partition: 8 leader: -1 replicas: 4 isr: > {code} > And other nodes were continuously logging the following errors in the log > file: > {code} > 2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - > [KafkaApi-3] Error while fetching metadata for partition [test_topic,8] > kafka.common.LeaderNotAvailableException: Leader not available for partition > [test_topic,8] > at > kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474) > at > kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61) > at scala.collection.immutable.List.foreach(List.scala:45) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:206) > at scala.collection.immutable.List.map(List.scala:45) > at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462) > at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123) > at > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:206) > at scala.collection.immutable.HashSet.map(HashSet.scala:32) > at > kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458) > at kafka.server.KafkaApis.handle(KafkaApis.scala:68) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42) > at java.lang.Thread.run(Thread.java:744) > {code} > I had to restart the failed kafka node to recover the cluster. We expect the > kafka cluster to work even if a node is down. Any clue what went wrong here? -- This message was sent by Atlassian JIRA (v6.2#6252)