[ 
https://issues.apache.org/jira/browse/KAFKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arup Malakar updated KAFKA-1466:
--------------------------------

    Attachment: kafka.log.1

[~jjkoshy] Some more info:

1. The topic we use for actual prod messages is not test_topic and has *two 
replicas*. We were unable to push messages to that topic, so the cluster was 
indeed unavailable:
{code}
topic: staging_thrift_streaming partition: 0    leader: 2       replicas: 4,2   
isr: 2,4
topic: staging_thrift_streaming partition: 1    leader: 1       replicas: 1,3   
isr: 1,3
topic: staging_thrift_streaming partition: 2    leader: 2       replicas: 2,4   
isr: 2,4
......
{code}

2. More info:

Java version:
{code}
java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
{code}

OS Version:
{code}
~# uname -a
Linux ip-X-X-X-X 3.2.0-51-virtual #77-Ubuntu SMP Wed Jul 24 20:38:32 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux
~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04 LTS
Release:        12.04
Codename:       precise
{code}

I couldn't find anything strange in the kernel logs though. I am attaching the 
kafka logs here.

> Kafka server is hung after throwing "Attempt to swap the new high watermark 
> file with the old one failed"
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1466
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1466
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Arup Malakar
>         Attachments: kafka.log.1
>
>
> We have a kafka cluster of four nodes. The cluster was down after one of the 
> nodes threw the following error:
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed. I saw the following message in the log file of the 
> failed node:
> {code}
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: 
> HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file 
> with the old one failed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaServer:67 - [Kafka Server 4], 
> Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper 
> client...
> 2014-05-21 23:19:44 INFO  
> [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u
> s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread.
> 2014-05-21 23:19:44 INFO  [main-EventThread]: ClientCnxn:521 - EventThread 
> shut down
> 2014-05-21 23:19:44 INFO  [Thread-1]: ZooKeeper:544 - Session: 
> 0x1456b562865b172 closed
> 2014-05-21 23:19:44 INFO  [kafka-processor-9092-0]: Processor:67 - Closing 
> socket connection to /10.245.173.136.
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on 
> Broker 4], Shutdown completed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka 
> Request Handler on Broker 4], shutted down completely
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaScheduler:67 - Shutdown Kafka 
> scheduler
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaManager:67 - [Replica Manager on 
> Broker 4]: Shut down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutting down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-3], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-3], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 
> - [ReplicaFetcherThread-0-2], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - 
> [ReplicaFetcherThread-0-2], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - 
> [ReplicaFetcherManager on broker 4] shutdown completed
> {code} 
> I notice that since this error was logged there weren't any more logs in the 
> log file but the process was still alive, so I guess it was hung.
> The other nodes in the cluster was not able to recover from this error. The 
> partitions owned by this failed node had its leader set to -1:
> {code}
> topic: test_topic     partition: 8    leader: -1      replicas: 4     isr:
> {code}
> And other nodes were continuously logging the following errors in the log 
> file:
> {code}
> 2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - 
> [KafkaApi-3] Error while fetching metadata for partition [test_topic,8]
> kafka.common.LeaderNotAvailableException: Leader not available for partition 
> [test_topic,8]
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474)
>       at 
> kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
>       at scala.collection.immutable.List.foreach(List.scala:45)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.List.map(List.scala:45)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462)
>       at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
>       at 
> scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
>       at 
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
>       at scala.collection.immutable.HashSet.map(HashSet.scala:32)
>       at 
> kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458)
>       at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
>       at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>       at java.lang.Thread.run(Thread.java:744)
> {code}
> I had to restart the failed kafka node to recover the cluster. We expect the 
> kafka cluster to work even if a node is down. Any clue what went wrong here?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to