Hi Till,

Following is the log file of one of the taskmanagers

09:55:37,071 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
     - Trying to select the network interface and address to use by
connecting to the leading JobManager.
09:55:37,072 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
     - TaskManager will try to connect for 10000 milliseconds before
falling back to heuristics
09:55:37,075 INFO  org.apache.flink.runtime.net.ConnectionUtils
     - Retrieved new target address /10.155.208.156:6123.
09:55:37,084 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager will use hostname/address 'vm-10-155-208-138.cloud.mwn.de'
(10.155.208.138) for communication.
09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager in streaming mode STREAMING
09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor system at 10.155.208.138:0
09:55:37,531 INFO  akka.event.slf4j.Slf4jLogger
     - Slf4jLogger started
09:55:37,587 INFO  Remoting
     - Starting remoting
09:55:37,774 INFO  Remoting
     - Remoting started; listening on addresses :[akka.tcp://
flink@10.155.208.138:49653]
09:55:37,782 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor
09:55:37,798 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
      - NettyConfig [server address:
vm-10-155-208-138.cloud.mwn.de/10.155.208.138, server port: 32798, memory
segment size (bytes): 32768, transport type: NIO, number of server threads:
0 (use Netty's default), number of client threads: 0 (use Netty's default),
server connect backlog: 0 (use Netty's default), client connect timeout
(sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
09:55:37,803 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Messages between TaskManager and JobManager have a max timeout of
100000 milliseconds
09:55:37,811 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Temporary file directory '/tmp': total 4 GB, usable 0 GB (0.00%
usable)
09:55:37,848 INFO
 org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
64 MB for network buffer pool (number of memory segments: 2048, bytes per
segment: 32768).
09:55:37,955 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Using 0.5 of the currently free heap space for Flink managed heap
memory (455 MB).
09:55:37,978 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
     - I/O manager uses directory
/tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f for spill files.
09:55:37,986 INFO  org.apache.flink.runtime.filecache.FileCache
     - User file cache uses directory
/tmp/flink-dist-cache-516dd09a-1dfe-46eb-b50b-b6e24b6e9fad
09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor at akka://flink/user/taskmanager#56985599.
09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager data connection information:
vm-10-155-208-138.cloud.mwn.de (dataPort=32798)
09:55:38,147 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager has 4 task slot(s).
09:55:38,148 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Memory usage stats: [HEAP: 100/990/990 MB, NON HEAP: 24/37/304 MB
(used/committed/max)]
09:55:38,151 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@10.155.208.156:6123/user/jobmanager (attempt 1, timeout: 500
milliseconds)
09:55:38,301 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Successful registration at JobManager (akka.tcp://
flink@10.155.208.156:6123/user/jobmanager), starting network stack and
library cache.
09:55:38,479 INFO  org.apache.flink.runtime.io.network.netty.NettyClient
      - Successful initialization (took 55 ms).
09:55:38,533 INFO  org.apache.flink.runtime.io.network.netty.NettyServer
      - Successful initialization (took 54 ms). Listening on SocketAddress /
10.155.208.138:32798.
09:55:38,534 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Determined BLOB server address to be /10.155.208.156:59504. Starting
BLOB cache.
09:55:38,536 INFO  org.apache.flink.runtime.blob.BlobCache
      - Created BLOB cache storage directory
/tmp/blobStore-8e88302d-3303-4c80-8613-f0be13911fb2
09:56:48,371 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
     - I/O manager removed spill file directory
/tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f

Kind Regards,
Ravinder

On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org>
wrote:

> Hi Ravinder,
>
> could you tell us what's written in the taskmanager log of the failing
> taskmanager? There should be some kind of failure why the taskmanager
> stopped working.
>
> Moreover, given that you have 64 GB of main memory, you could easily give
> 50GB as heap memory to each taskmanager.
>
> Cheers,
> Till
>
> On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I'm running a simple word count example using the quickstart package from
>> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set of
>> randomly generated words of length 8.
>>
>> Cluster Configuration:
>>
>> Number of machines: 7
>> Total cores : 25
>> Memory on each: 64GB
>>
>> I'm interested in the performance measure between Batch and Stream modes
>> and so I'm running WordCount example with number of iteration (max 10) on
>> datasets of sizes ranging between 100MB and 50GB consisting of random words
>> of length 4 and 8.
>>
>> While I ran the experiments in Batch mode all iterations ran fine, but
>> now I'm stuck in the Streaming mode at this
>>
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>         at java.util.HashMap.resize(HashMap.java:580)
>>         at java.util.HashMap.addEntry(HashMap.java:879)
>>         at java.util.HashMap.put(HashMap.java:505)
>>         at
>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
>>         at
>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
>>         at
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
>>         at
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
>>         at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> I investigated found 2 solutions. (1) Increasing the taskmanager.heap.mb
>> and (2) Reducing the taskmanager.memory.fraction
>>
>> Therefore I set taskmanager.heap.mb: 1024 and
>> taskmanager.memory.fraction: 0.5 (default 0.7)
>>
>> When I ran the example with this setting I loose taskmanagers one by one
>> during the job execution with the following cause
>>
>> Caused by: java.lang.Exception: The slot in which the task was executed
>> has been released. Probably loss of TaskManager
>> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
>> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
>>         at
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>>         at
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>         at
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>         at
>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>         at
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>         at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>         at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>         at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>         at
>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         ... 2 more
>>
>>
>> While I look at the results generated at each taskmanager, they are fine.
>> The logs also don't show any causes for the the job to get cancelled.
>>
>>
>> Could anyone kindly guide me here?
>>
>> Kind Regards,
>> Ravinder Kaur.
>>
>
>

Reply via email to