Will definitely take a thread dump! So, far its been running fine.

-Jacob

On Wed, Oct 15, 2014 at 8:40 PM, Jun Rao <jun...@gmail.com> wrote:

> If you see the hanging again, it would be great if you can take a thread
> dump so that we know where it is hanging.
>
> Thanks,
>
> Jun
>
> On Tue, Oct 14, 2014 at 10:35 PM, Abraham Jacob <abe.jac...@gmail.com>
> wrote:
>
> > Hi Jun,
> >
> > Thanks for responding...
> >
> > I am using Kafka 2.9.2-0.8.1.1
> >
> > I looked through the controller logs on a couple of nodes and did not
> find
> > any exceptions or error.
> >
> > However in the state change log I see a bunch of the following
> exceptions -
> >
> > [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader
> > election for partition [wordcount,1] (state.change.logger)
> > [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state
> > change for partition [wordcount,1] from OfflinePartition to
> OnlinePartition
> > failed (state.change.logger)
> > kafka.common.NoReplicaOnlineException: No replica for partition
> > [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are:
> > [List(8, 7, 1)]
> >         at
> >
> >
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
> >         at
> >
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
> >         at
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> >         at
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> >         at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> >         at
> > scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> >         at
> >
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> >         at
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> >         at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> >         at
> >
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96)
> >         at
> >
> >
> kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68)
> >         at
> >
> >
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312)
> >         at
> >
> >
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162)
> >         at
> >
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63)
> >         at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123)
> >         at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> >         at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> >         at kafka.utils.Utils$.inLock(Utils.scala:538)
> >         at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118)
> >         at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549)
> >         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> >
> >
> > Anyways, this morning after sending out the email, I set out to restart
> all
> > the brokers. I found that 3 brokers were in a hung state. I tried to use
> > the bin/kafka-server-stop.sh script (which is nothing but sending a
> SIGINT
> > signal), the java process running kafka would not terminate, I then
> issued
> > a 'kill -SIGTERM xxxxx' for the java process running Kafka, yet the
> process
> > would not terminate. This happened only on 3 nodes (1 node is running
> only
> > 1 broker). For the other nodes kafka-server-stop.sh successfully bought
> > down the java process running Kafka.
> >
> > For the three brokers that was not responding to either SIGINT and
> SIGTERM
> > signal I issued a SIGKILL instead and this, for sure brought down the
> > process.
> >
> > I then restarted brokers on all nodes. After that I again ran the
> describe
> > topic script.
> >
> > bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b.
> > ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount
> >
> >
> > Topic:wordcount PartitionCount:8        ReplicationFactor:3     Configs:
> >         Topic: wordcount        Partition: 0    Leader: 7       Replicas:
> > 7,6,8 Isr: 6,7,8
> >         Topic: wordcount        Partition: 1    Leader: 8       Replicas:
> > 8,7,1 Isr: 1,7,8
> >         Topic: wordcount        Partition: 2    Leader: 1       Replicas:
> > 1,8,2 Isr: 1,2,8
> >         Topic: wordcount        Partition: 3    Leader: 2       Replicas:
> > 2,1,3 Isr: 1,2,3
> >         Topic: wordcount        Partition: 4    Leader: 3       Replicas:
> > 3,2,4 Isr: 2,3,4
> >         Topic: wordcount        Partition: 5    Leader: 4       Replicas:
> > 4,3,5 Isr: 3,4,5
> >         Topic: wordcount        Partition: 6    Leader: 5       Replicas:
> > 5,4,6 Isr: 4,5,6
> >         Topic: wordcount        Partition: 7    Leader: 6       Replicas:
> > 6,5,7 Isr: 5,6,7
> >
> > Since then it is been running fine.
> >
> > My conclusion is that for some reason (which I don't really understand),
> 3
> > brokers were effectively in a hung state and probably caused the broken
> > cluster.
> >
> > Regards,
> > -Jacob
> >
> >
> >
> >
> >
> > On Tue, Oct 14, 2014 at 5:39 PM, Jun Rao <jun...@gmail.com> wrote:
> >
> > > Also, which version of Kafka are you using?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Oct 14, 2014 at 5:31 PM, Jun Rao <jun...@gmail.com> wrote:
> > >
> > > > The following is a bit weird. It indicates no leader for partition 4,
> > > > which is inconsistent with what describe-topic shows.
> > > >
> > > > 2014-10-13 19:02:32,611 WARN [main]
> kafka.producer.BrokerPartitionInfo:
> > > > Error while fetching metadata   partition 4     leader: none
> > > replicas: 3
> > > > (tr-pan-hclstr-13.amers1b.ciscloud:9092),2
> > > > (tr-pan-hclstr-12.amers1b.ciscloud:9092),4
> > > > (tr-pan-hclstr-14.amers1b.ciscloud:9092)     isr:
> isUnderReplicated:
> > > > true for topic partition [wordcount,4]: [class
> > > > kafka.common.LeaderNotAvailableException]
> > > >
> > > > Any error in the controller and the state-change log? Do you see
> > broker 3
> > > > marked as dead in the controller log? Also, could you check if the
> > broker
> > > > registration in ZK (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper
> > > )
> > > > has the correct host/port?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Oct 13, 2014 at 5:35 PM, Abraham Jacob <abe.jac...@gmail.com
> >
> > > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I
> > > have
> > > >> a
> > > >> topic "wordcount", which was 8 partitions with a replication factor
> of
> > > 3.
> > > >>
> > > >> So a describe of topic wordcount
> > > >> # bin/kafka-topics.sh --describe --zookeeper
> > > >> tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic
> > > >> wordcount
> > > >>
> > > >>
> > > >> Topic:wordcount PartitionCount:8        ReplicationFactor:3
> >  Configs:
> > > >> Topic: wordcount        Partition: 0    Leader: 6       Replicas:
> > 7,6,8
> > > >> Isr: 6,7,8
> > > >> Topic: wordcount        Partition: 1    Leader: 7       Replicas:
> > 8,7,1
> > > >> Isr: 7
> > > >> Topic: wordcount        Partition: 2    Leader: 8       Replicas:
> > 1,8,2
> > > >> Isr: 8
> > > >> Topic: wordcount        Partition: 3    Leader: 3       Replicas:
> > 2,1,3
> > > >> Isr: 3
> > > >> Topic: wordcount        Partition: 4    Leader: 3       Replicas:
> > 3,2,4
> > > >> Isr: 3,2,4
> > > >> Topic: wordcount        Partition: 5    Leader: 3       Replicas:
> > 4,3,5
> > > >> Isr: 3,5
> > > >> Topic: wordcount        Partition: 6    Leader: 6       Replicas:
> > 5,4,6
> > > >> Isr: 6,5
> > > >> Topic: wordcount        Partition: 7    Leader: 6       Replicas:
> > 6,5,7
> > > >> Isr: 6,5,7
> > > >>
> > > >> I wrote a simple producer to write to this topic. However when
> > running I
> > > >> get these messages in the logs -
> > > >>
> > > >> 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$:
> > Fetching
> > > >> metadata from broker
> > > id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092
> > > >> with correlation id 0 for 1 topic(s) Set(wordcount)
> > > >> 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer:
> > > Connected
> > > >> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
> > > >> 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer:
> > > >> Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092
> > > >> 2014-10-13 19:02:32,611 WARN [main]
> > kafka.producer.BrokerPartitionInfo:
> > > >> Error while fetching metadata   partition 4     leader: none
> > > replicas:
> > > >> 3
> > > >> (tr-pan-hclstr-13.amers1b.ciscloud:9092),2
> > > >> (tr-pan-hclstr-12.amers1b.ciscloud:9092),4
> > > >> (tr-pan-hclstr-14.amers1b.ciscloud:9092)     isr:
> > isUnderReplicated:
> > > >> true for topic partition [wordcount,4]: [class
> > > >> kafka.common.LeaderNotAvailableException]
> > > >> 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer:
> > > Connected
> > > >> to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing
> > > >> 2014-10-13 19:02:33,543 WARN [main]
> > > >> kafka.producer.async.DefaultEventHandler: Produce request with
> > > correlation
> > > >> id 20611 failed due to [wordcount,5]:
> > > >> kafka.common.NotLeaderForPartitionException,[wordcount,6]:
> > > >> kafka.common.NotLeaderForPartitionException,[wordcount,7]:
> > > >> kafka.common.NotLeaderForPartitionException
> > > >> 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer:
> > > Connected
> > > >> to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing
> > > >> 2014-10-13 19:02:33,725 WARN [main]
> > > >> kafka.producer.async.DefaultEventHandler: Produce request with
> > > correlation
> > > >> id 20612 failed due to [wordcount,0]:
> > > >> kafka.common.NotLeaderForPartitionException
> > > >> 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer:
> > > Connected
> > > >> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
> > > >> 2014-10-13 19:02:33,983 WARN [main]
> > > >> kafka.producer.async.DefaultEventHandler: Failed to send data since
> > > >> partitions [wordcount,4] don't have a leader
> > > >>
> > > >>
> > > >> Obviously something is terribly wrong... I am quite new to Kafka,
> > hence
> > > >> these messages don't make any sense to me, except for the fact that
> it
> > > is
> > > >> telling me that some of the partitions don't have any leader.
> > > >>
> > > >> Could somebody be kind enough to explain the above message?
> > > >>
> > > >> A few more questions -
> > > >>
> > > >> (1) How does one get into this state?
> > > >> (2) How can I get out of this state?
> > > >> (3) I have set auto.leader.rebalance.enable=true on all brokers.
> > > Shouldn't
> > > >> the partitions be balanced across all the brokers?
> > > >> (4) I can see that the Kafka service are running on all 8 nodes. (I
> > used
> > > >>  ps ax -o  "pid pgid args" and I can see under the kafka Java
> > process).
> > > >> (5) Is there a way I can force a re-balance?
> > > >>
> > > >>
> > > >>
> > > >> Regards,
> > > >> Jacob
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> ~
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ~
> >
>



-- 
~

Reply via email to