[jira] [Comment Edited] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

Jay Kreps (JIRA) Wed, 28 May 2014 15:54:26 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011750#comment-14011750
 ]


Jay Kreps edited comment on KAFKA-1466 at 5/28/14 10:53 PM:
------------------------------------------------------------

Is it possible you were out of disk space?

Here is the sequence of operations that occurs (from OffsetCheckpoint.scala):
{noformat}
      // swap new offset checkpoint file with previous one
      if(!temp.renameTo(file)) {
        // renameTo() fails on Windows if the destination file exists.
        file.delete()
        if(!temp.renameTo(file))
          throw new IOException("File rename from %s to %s 
failed.".format(temp.getAbsolutePath, file.getAbsolutePath))
      }
{noformat}

We first try to just rename the new checkpoint to the old checkpoint. On unix 
this is atomic, but on windows it will fail. If it fails then we manually 
delete the current checkpoint and repeat the rename. This also failed, which is 
what threw the exception. Unfortunately java doesn't give a lot of info when it 
fails so it is a bit hard to debug. Is it possible that the process somehow 
lost permission to write to the checkpoint file or something like that?

Or perhaps this was some kind of transient disk issue.

Regardless if this occurs the correct behavior would be for the IOException to 
be thrown and the server to kill itself. This seems to have happened, but 
somehow the process didn't die?

I think we could try to reproduce this by removing permissions on the offset 
checkpoint file while the server is running. When this happens the expected 
behavior is that the server should shut itself down and another replica should 
be elected as leader.

If we can figure out anything that would cause the rename to fail that we 
aren't handling right then that will be a bug.

If we can reproduce the server not cleanly killing itself and fully exiting 
then that would another bug.


was (Author: jkreps):
Is it possible you were out of disk space?

Here is the sequence of operations that occurs:

      // swap new offset checkpoint file with previous one
      if(!temp.renameTo(file)) {
        // renameTo() fails on Windows if the destination file exists.
        file.delete()
        if(!temp.renameTo(file))
          throw new IOException("File rename from %s to %s 
failed.".format(temp.getAbsolutePath, file.getAbsolutePath))
      }

We first try to just rename the new checkpoint to the old checkpoint. On unix 
this is atomic, but on windows it will fail. If it fails then we manually 
delete the current checkpoint and repeat the rename. This also failed, which is 
what threw the exception. Unfortunately java doesn't give a lot of info when it 
fails so it is a bit hard to debug. Is it possible that the process somehow 
lost permission to write to the checkpoint file or something like that?

Or perhaps this was some kind of transient disk issue.

Regardless if this occurs the correct behavior would be for the IOException to 
be thrown and the server to kill itself. This seems to have happened, but 
somehow the process didn't die?

I think we could try to reproduce this by removing permissions on the offset 
checkpoint file while the server is running. When this happens the expected 
behavior is that the server should shut itself down and another replica should 
be elected as leader.

If we can figure out anything that would cause the rename to fail that we 
aren't handling right then that will be a bug.

If we can reproduce the server not cleanly killing itself and fully exiting 
then that would another bug.

> Kafka server is hung after throwing "Attempt to swap the new high watermark 
> file with the old one failed"
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1466
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1466
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Arup Malakar
>         Attachments: kafka.log.1
>
>
> We have a kafka cluster of four nodes. The cluster was down after one of the 
> nodes threw the following error:
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed. I saw the following message in the log file of the 
> failed node:
> {code}
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaServer:67 - [Kafka Server 4], 
> Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper 
> client...
> 2014-05-21 23:19:44 INFO  
> [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u
> s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread.
> 2014-05-21 23:19:44 INFO  [main-EventThread]: ClientCnxn:521 - EventThread 
> shut down
> 2014-05-21 23:19:44 INFO  [Thread-1]: ZooKeeper:544 - Session: 
> 0x1456b562865b172 closed
> 2014-05-21 23:19:44 INFO  [kafka-processor-9092-0]: Processor:67 - Closing 
> socket connection to /10.245.173.136.
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutdown completed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutted down completely
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaScheduler:67 - Shutdown Kafka 
> scheduler
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaManager:67 - [Replica Manager on 
> Broker 4]: Shut down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutting down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-3], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-2], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutdown completed
> {code} 
> I notice that since this error was logged there weren't any more logs in the 
> log file but the process was still alive, so I guess it was hung.
> The other nodes in the cluster was not able to recover from this error. The 
> partitions owned by this failed node had its leader set to -1:
> {code}
> topic: test_topic     partition: 8    leader: -1      replicas: 4     isr:
> {code}
> And other nodes were continuously logging the following errors in the log 
> file:
> {code}
> 2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - 
> [KafkaApi-3] Error while fetching metadata for partition [test_topic,8]
> kafka.common.LeaderNotAvailableException: Leader not available for partition 
> [test_topic,8]
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474)
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
>       at scala.collection.immutable.List.foreach(List.scala:45)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.List.map(List.scala:45)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
>       at 
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.HashSet.map(HashSet.scala:32)
>       at 
> kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458)
>       at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>       at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>       at java.lang.Thread.run(Thread.java:744)
> {code}
> I had to restart the failed kafka node to recover the cluster. We expect the 
> kafka cluster to work even if a node is down. Any clue what went wrong here?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

Reply via email to