Hi Ravinder,

this should not be the relevant log extract. The log says that the TM is
started on port 49653 and the JM log says that the TM on port 42222 is
lost. Would you mind to share the complete JM and TM logs with us?

Cheers,
Till

On Tue, Mar 15, 2016 at 10:54 AM, Ravinder Kaur <neetu0...@gmail.com> wrote:

> Hello Ufuk,
>
> Yes, the same WordCount program is being run.
>
> Kind Regards,
> Ravinder Kaur
>
> On Tue, Mar 15, 2016 at 10:45 AM, Ufuk Celebi <u...@apache.org> wrote:
>
>> What do you mean with iteration in this context? Are you repeatedly
>> running the same WordCount program for streaming and batch
>> respectively?
>>
>> – Ufuk
>>
>> On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org>
>> wrote:
>> > Hi Ravinder,
>> >
>> > could you tell us what's written in the taskmanager log of the failing
>> > taskmanager? There should be some kind of failure why the taskmanager
>> > stopped working.
>> >
>> > Moreover, given that you have 64 GB of main memory, you could easily
>> give
>> > 50GB as heap memory to each taskmanager.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com>
>> wrote:
>> >>
>> >> Hello All,
>> >>
>> >> I'm running a simple word count example using the quickstart package
>> from
>> >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set
>> of
>> >> randomly generated words of length 8.
>> >>
>> >> Cluster Configuration:
>> >>
>> >> Number of machines: 7
>> >> Total cores : 25
>> >> Memory on each: 64GB
>> >>
>> >> I'm interested in the performance measure between Batch and Stream
>> modes
>> >> and so I'm running WordCount example with number of iteration (max 10)
>> on
>> >> datasets of sizes ranging between 100MB and 50GB consisting of random
>> words
>> >> of length 4 and 8.
>> >>
>> >> While I ran the experiments in Batch mode all iterations ran fine, but
>> now
>> >> I'm stuck in the Streaming mode at this
>> >>
>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>> >>         at java.util.HashMap.resize(HashMap.java:580)
>> >>         at java.util.HashMap.addEntry(HashMap.java:879)
>> >>         at java.util.HashMap.put(HashMap.java:505)
>> >>         at
>> >>
>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
>> >>         at
>> >>
>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
>> >>         at
>> >>
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
>> >>         at
>> >>
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
>> >>         at
>> >>
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
>> >>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>> >> I investigated found 2 solutions. (1) Increasing the
>> taskmanager.heap.mb
>> >> and (2) Reducing the taskmanager.memory.fraction
>> >>
>> >> Therefore I set taskmanager.heap.mb: 1024 and
>> taskmanager.memory.fraction:
>> >> 0.5 (default 0.7)
>> >>
>> >> When I ran the example with this setting I loose taskmanagers one by
>> one
>> >> during the job execution with the following cause
>> >>
>> >> Caused by: java.lang.Exception: The slot in which the task was executed
>> >> has been released. Probably loss of TaskManager
>> >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
>> >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
>> >>         at
>> >>
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>> >>         at
>> >>
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>> >>         at
>> >>
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>> >>         at
>> >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>> >>         at
>> >>
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>> >>         at
>> >>
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>> >>         at
>> >>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>> >>         at
>> >>
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >>         at
>> >>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>> >>         at
>> >>
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >>         at
>> >>
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >>         at
>> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> >>         at
>> >>
>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >>         at
>> >>
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >>         at
>> >>
>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>> >>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>> >>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>> >>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>> >>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >>         at
>> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >>         at
>> >>
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> >>         ... 2 more
>> >>
>> >>
>> >> While I look at the results generated at each taskmanager, they are
>> fine.
>> >> The logs also don't show any causes for the the job to get cancelled.
>> >>
>> >>
>> >> Could anyone kindly guide me here?
>> >>
>> >> Kind Regards,
>> >> Ravinder Kaur.
>> >
>> >
>>
>
>

Reply via email to