Hi Till, Following is the log file of one of the taskmanagers
09:55:37,071 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager. 09:55:37,072 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 09:55:37,075 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.155.208.156:6123. 09:55:37,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'vm-10-155-208-138.cloud.mwn.de' (10.155.208.138) for communication. 09:55:37,085 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager in streaming mode STREAMING 09:55:37,085 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 10.155.208.138:0 09:55:37,531 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 09:55:37,587 INFO Remoting - Starting remoting 09:55:37,774 INFO Remoting - Remoting started; listening on addresses :[akka.tcp:// flink@10.155.208.138:49653] 09:55:37,782 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor 09:55:37,798 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: vm-10-155-208-138.cloud.mwn.de/10.155.208.138, server port: 32798, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 09:55:37,803 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds 09:55:37,811 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 4 GB, usable 0 GB (0.00% usable) 09:55:37,848 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768). 09:55:37,955 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.5 of the currently free heap space for Flink managed heap memory (455 MB). 09:55:37,978 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f for spill files. 09:55:37,986 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-516dd09a-1dfe-46eb-b50b-b6e24b6e9fad 09:55:38,146 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#56985599. 09:55:38,146 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: vm-10-155-208-138.cloud.mwn.de (dataPort=32798) 09:55:38,147 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 4 task slot(s). 09:55:38,148 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 100/990/990 MB, NON HEAP: 24/37/304 MB (used/committed/max)] 09:55:38,151 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp:// flink@10.155.208.156:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds) 09:55:38,301 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp:// flink@10.155.208.156:6123/user/jobmanager), starting network stack and library cache. 09:55:38,479 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 55 ms). 09:55:38,533 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 54 ms). Listening on SocketAddress / 10.155.208.138:32798. 09:55:38,534 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /10.155.208.156:59504. Starting BLOB cache. 09:55:38,536 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-8e88302d-3303-4c80-8613-f0be13911fb2 09:56:48,371 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f Kind Regards, Ravinder On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Ravinder, > > could you tell us what's written in the taskmanager log of the failing > taskmanager? There should be some kind of failure why the taskmanager > stopped working. > > Moreover, given that you have 64 GB of main memory, you could easily give > 50GB as heap memory to each taskmanager. > > Cheers, > Till > > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com> > wrote: > >> Hello All, >> >> I'm running a simple word count example using the quickstart package from >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set of >> randomly generated words of length 8. >> >> Cluster Configuration: >> >> Number of machines: 7 >> Total cores : 25 >> Memory on each: 64GB >> >> I'm interested in the performance measure between Batch and Stream modes >> and so I'm running WordCount example with number of iteration (max 10) on >> datasets of sizes ranging between 100MB and 50GB consisting of random words >> of length 4 and 8. >> >> While I ran the experiments in Batch mode all iterations ran fine, but >> now I'm stuck in the Streaming mode at this >> >> Caused by: java.lang.OutOfMemoryError: Java heap space >> at java.util.HashMap.resize(HashMap.java:580) >> at java.util.HashMap.addEntry(HashMap.java:879) >> at java.util.HashMap.put(HashMap.java:505) >> at >> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98) >> at >> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59) >> at >> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) >> at >> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) >> at java.lang.Thread.run(Thread.java:745) >> >> I investigated found 2 solutions. (1) Increasing the taskmanager.heap.mb >> and (2) Reducing the taskmanager.memory.fraction >> >> Therefore I set taskmanager.heap.mb: 1024 and >> taskmanager.memory.fraction: 0.5 (default 0.7) >> >> When I ran the example with this setting I loose taskmanagers one by one >> during the job execution with the following cause >> >> Caused by: java.lang.Exception: The slot in which the task was executed >> has been released. Probably loss of TaskManager >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL: >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153) >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >> at >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> ... 2 more >> >> >> While I look at the results generated at each taskmanager, they are fine. >> The logs also don't show any causes for the the job to get cancelled. >> >> >> Could anyone kindly guide me here? >> >> Kind Regards, >> Ravinder Kaur. >> > >