What do you mean with iteration in this context? Are you repeatedly running the same WordCount program for streaming and batch respectively?
– Ufuk On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Ravinder, > > could you tell us what's written in the taskmanager log of the failing > taskmanager? There should be some kind of failure why the taskmanager > stopped working. > > Moreover, given that you have 64 GB of main memory, you could easily give > 50GB as heap memory to each taskmanager. > > Cheers, > Till > > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com> wrote: >> >> Hello All, >> >> I'm running a simple word count example using the quickstart package from >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set of >> randomly generated words of length 8. >> >> Cluster Configuration: >> >> Number of machines: 7 >> Total cores : 25 >> Memory on each: 64GB >> >> I'm interested in the performance measure between Batch and Stream modes >> and so I'm running WordCount example with number of iteration (max 10) on >> datasets of sizes ranging between 100MB and 50GB consisting of random words >> of length 4 and 8. >> >> While I ran the experiments in Batch mode all iterations ran fine, but now >> I'm stuck in the Streaming mode at this >> >> Caused by: java.lang.OutOfMemoryError: Java heap space >> at java.util.HashMap.resize(HashMap.java:580) >> at java.util.HashMap.addEntry(HashMap.java:879) >> at java.util.HashMap.put(HashMap.java:505) >> at >> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98) >> at >> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59) >> at >> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) >> at >> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) >> at java.lang.Thread.run(Thread.java:745) >> >> I investigated found 2 solutions. (1) Increasing the taskmanager.heap.mb >> and (2) Reducing the taskmanager.memory.fraction >> >> Therefore I set taskmanager.heap.mb: 1024 and taskmanager.memory.fraction: >> 0.5 (default 0.7) >> >> When I ran the example with this setting I loose taskmanagers one by one >> during the job execution with the following cause >> >> Caused by: java.lang.Exception: The slot in which the task was executed >> has been released. Probably loss of TaskManager >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL: >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153) >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >> at >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> ... 2 more >> >> >> While I look at the results generated at each taskmanager, they are fine. >> The logs also don't show any causes for the the job to get cancelled. >> >> >> Could anyone kindly guide me here? >> >> Kind Regards, >> Ravinder Kaur. > >