[jira] [Commented] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

Arup Malakar (JIRA) Fri, 23 May 2014 13:25:35 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007651#comment-14007651
 ]


Arup Malakar commented on KAFKA-1466:
-------------------------------------

[~fancyrao] couldn't reproduce the error. It went away the moment I restarted 
the kafka server and haven't seen the error since. 

There are two parts to the issue:
- We expected the rest of the kafla cluster to continue functioning, despite 
failure of one node in the cluster, but it wasn't. We were unable to push any 
message to the cluster.
- The kafka node which threw error should have died after throwing the fatal 
error so  that it would get restarted by upstart/init.d. Instead it  was just 
hung and upstart was not aware of it.

I would be happy to provide any other details that may be helpful in finding 
out the issue.

> Kafka server is hung after throwing "Attempt to swap the new high watermark 
> file with the old one failed"
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1466
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1466
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Arup Malakar
>
> We have a kafka cluster of four nodes. The cluster was down after one of the 
> nodes threw the following error:
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed. I saw the following message in the log file of the 
> failed node:
> {code}
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaServer:67 - [Kafka Server 4], 
> Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper 
> client...
> 2014-05-21 23:19:44 INFO  
> [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u
> s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread.
> 2014-05-21 23:19:44 INFO  [main-EventThread]: ClientCnxn:521 - EventThread 
> shut down
> 2014-05-21 23:19:44 INFO  [Thread-1]: ZooKeeper:544 - Session: 
> 0x1456b562865b172 closed
> 2014-05-21 23:19:44 INFO  [kafka-processor-9092-0]: Processor:67 - Closing 
> socket connection to /10.245.173.136.
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutdown completed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutted down completely
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaScheduler:67 - Shutdown Kafka 
> scheduler
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaManager:67 - [Replica Manager on 
> Broker 4]: Shut down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutting down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-3], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-2], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutdown completed
> {code} 
> I notice that since this error was logged there weren't any more logs in the 
> log file but the process was still alive, so I guess it was hung.
> The other nodes in the cluster was not able to recover from this error. The 
> partitions owned by this failed node had its leader set to -1:
> {code}
> topic: test_topic     partition: 8    leader: -1      replicas: 4     isr:
> {code}
> And other nodes were continuously logging the following errors in the log 
> file:
> {code}
> 2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - 
> [KafkaApi-3] Error while fetching metadata for partition [test_topic,8]
> kafka.common.LeaderNotAvailableException: Leader not available for partition 
> [test_topic,8]
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474)
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
>       at scala.collection.immutable.List.foreach(List.scala:45)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.List.map(List.scala:45)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
>       at 
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.HashSet.map(HashSet.scala:32)
>       at 
> kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458)
>       at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>       at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>       at java.lang.Thread.run(Thread.java:744)
> {code}
> I had to restart the failed kafka node to recover the cluster. We expect the 
> kafka cluster to work even if a node is down. Any clue what went wrong here?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

Reply via email to