Somehow I am getting my instances of kafka to crash. I started kafka instances one by one and they started successfully. Later it some how two of 3 instances became completely unresponsive. The process is running, but connnection over jmx or taking heat dump not possible. The last one some what resposnive. I am not sure how server get to this state. Is there anything I can monitor to predict instances about to crash. What are ways to recover without data loss? What am I doing wrong to get to this state. Please advise. I poke around error logs on hosts that are not responsive and here are the errors I found. One that I have not listed LeaderNotFoundExceotion.
The most puzzling is about zookeeper as it was not redeployed or updated. [2013-08-26 12:14:35,357] ERROR [KafkaApi-5] Error while fetching metadata for partition [self_reactivation,0] (kafka.server.KafkaApis) kafka.common.ReplicaNotAvailableException at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:471) at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:456) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) at scala.collection.immutable.List.foreach(List.scala:76) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) in server.log [2013-08-26 21:00:51,942] ERROR Conditional update of path /brokers/topics/meetme/partitions/12/state with data { "controller_epoch":6, "isr":[ 5 ], "leader":5, "leader_epoch":1, "version":1 } and expected version 2 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/meetme/partitions/12/state (kafka.utils.ZkUtils$) [2013-08-26 21:00:51,943] INFO Partition [meetme,12] on broker 5: Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2013-08-26 21:00:51,990] INFO Partition [meetme,4] on broker 5: Shrinking ISR for partition [meetme,4] from 5,4 to 5 (kafka.cluster.Partition) [2013-08-26 21:00:51,993] ERROR Conditional update of path /brokers/topics/meetme/partitions/4/state with data { "controller_epoch":6, "isr":[ 5 ], "leader":5, "leader_epoch":1, "version":1 } and expected version 2 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/meetme/partitions/4/state (kafka.utils.ZkUtils$) [2013-08-26 21:00:51,993] INFO Partition [meetme,4] on broker 5: Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2013-08-26 21:00:52,103] INFO Partition [meetme,6] on broker 5: Shrinking ISR for partition [meetme,6] from 5,4 to 5 (kafka.cluster.Partition) [2013-08-26 21:00:52,107] ERROR Conditional update of path /brokers/topics/meetme/partitions/6/state with data { "controller_epoch":6, "isr":[ 5 ], "leader":5, "leader_epoch":2, "version":1 } and expected version 3 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/meetme/partitions/6/state (kafka.utils.ZkUtils$) [2013-08-26 21:00:52,107] INFO Partition [meetme,6] on broker 5: Cached zkVersion [3] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)