Re: OutofMemoryError: Java heap space & Loss of Taskmanager

Till Rohrmann Tue, 15 Mar 2016 03:34:56 -0700

Hi Ravinder,

the log of the TM you've sent is the log of the only TM which has not been
disassociated from the JM. Can it be that you simply stopped the cluster
which results in the disassociation events?


Normally, Flink should kill all processes. If you have some processes
lingering around, then you should kill them first.

The more memory you provide the more data can be kept in memory. Whenever
the managed memory is full, then it will be spilled to disk. That's how you
can also process data which does not fit completely into memory. However,
all elements which are given to a user function will be kept on the heap
space. If it now happens that your elements become too big or you keep too
many elements on the heap, you'll see an OOM exception. Then it helps if
you increase the assigned memory or lower the memory fraction.

Cheers,
Till

On Tue, Mar 15, 2016 at 11:17 AM, Ravinder Kaur <neetu0...@gmail.com> wrote:

> Hi Till,
>
> Log of JobManager
>
> 09:55:31,574 WARN  org.apache.hadoop.util.NativeCodeLoader
>       - Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -
> --------------------------------------------------------------------------------
> 09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Starting JobManager (Version: 0.10.1, Rev:2e9b231,
> Date:22.11.2015 @ 12:41:12 CET)
> 09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Current user: flink
> 09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.95-b01
> 09:55:31,743 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Maximum heap size: 246 MiBytes
> 09:55:31,743 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
> 09:55:31,745 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Hadoop version: 2.7.0
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  JVM Options:
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     -Xms256m
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     -Xmx256m
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     -XX:MaxPermSize=256m
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -
> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -
> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -
> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Program Arguments:
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     --configDir
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     /home/flink/flink-0.10.1/conf
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     --executionMode
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     cluster
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     --streamingMode
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -     streaming
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -  Classpath:
> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:::
> 09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        -
> --------------------------------------------------------------------------------
> 09:55:31,924 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Loading configuration from /home/flink/flink-0.10.1/conf
> 09:55:31,941 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Staring JobManager without high-availability
> 09:55:31,950 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager on 10.155.208.156:6123 with execution mode
> CLUSTER and streaming mode STREAMING
> 09:55:32,039 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Security is not enabled. Starting non-authenticated JobManager.
> 09:55:32,039 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager
> 09:55:32,040 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager actor system at 10.155.208.156:6123
> 09:55:32,483 INFO  akka.event.slf4j.Slf4jLogger
>        - Slf4jLogger started
> 09:55:32,564 INFO  Remoting
>        - Starting remoting
> 09:55:32,730 INFO  Remoting
>        - Remoting started; listening on addresses :[akka.tcp://
> flink@10.155.208.156:6123]
> 09:55:32,731 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManger web frontend
> 09:55:32,761 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Using directory /tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8
> for the web interface files
> 09:55:32,762 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Serving job manager log from
> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log
> 09:55:32,762 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Serving job manager stdout from
> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.out
> 09:55:33,040 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Web frontend listening at 0:0:0:0:0:0:0:0:8081
> 09:55:33,041 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager actor
> 09:55:33,046 INFO  org.apache.flink.runtime.blob.BlobServer
>        - Created BLOB server storage directory
> /tmp/blobStore-28cbb318-efc7-4a8b-85a0-1ea6539f1ba8
> 09:55:33,047 INFO  org.apache.flink.runtime.blob.BlobServer
>        - Started BLOB server at 0.0.0.0:59504 - max concurrent requests:
> 50 - max backlog: 1000
> 09:55:33,057 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - Starting JobManager at akka.tcp://
> flink@10.155.208.156:6123/user/jobmanager.
> 09:55:33,060 INFO  org.apache.flink.runtime.jobmanager.JobManager
>        - JobManager akka.tcp://flink@10.155.208.156:6123/user/jobmanager
> was granted leadership with leader session ID None.
> 09:55:33,063 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>       - Started memory archivist akka://flink/user/archive
> 09:55:33,064 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Starting with JobManager akka.tcp://
> flink@10.155.208.156:6123/user/jobmanager on port 8081
> 09:55:33,064 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever
>       - New leader reachable under akka.tcp://
> flink@10.155.208.156:6123/user/jobmanager:null.
> 09:55:34,013 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at vm-10-155-208-156 (akka.tcp://
> flink@10.155.208.156:33728/user/taskmanager) as
> 9b665ef16d88314bf37f816b5afdfe79. Current number of registered hosts is 1.
> Current number of alive task slots is 3.
> 09:55:34,735 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at vm-10-155-208-157 (akka.tcp://
> flink@10.155.208.157:33728/user/taskmanager) as
> fc8b661bb0ad5568855c8c2d6a0029f2. Current number of registered hosts is 2.
> Current number of alive task slots is 6.
> 09:55:35,394 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at vm-10-155-208-158 (akka.tcp://
> flink@10.155.208.158:33728/user/taskmanager) as
> c12e0ca481aaa4bdd6fa6be4cfc8335e. Current number of registered hosts is 3.
> Current number of alive task slots is 9.
> 09:55:36,105 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at slave3 (akka.tcp://
> flink@10.155.208.135:37219/user/taskmanager) as
> 6e023a6417743d1e5d67410dc76824b8. Current number of registered hosts is 4.
> Current number of alive task slots is 13.
> 09:55:36,542 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at vm-10-155-208-137 (akka.tcp://
> flink@10.155.208.137:35052/user/taskmanager) as
> c0098ff87ca76c958ca18a4e7d30e24e. Current number of registered hosts is 5.
> Current number of alive task slots is 17.
> 09:55:37,422 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at slave2 (akka.tcp://
> flink@10.155.208.136:50432/user/taskmanager) as
> d62208a25ce877757aeb72fe4b6530fc. Current number of registered hosts is 6.
> Current number of alive task slots is 21.
> 09:55:38,351 INFO  org.apache.flink.runtime.instance.InstanceManager
>       - Registered TaskManager at vm-10-155-208-138 (akka.tcp://
> flink@10.155.208.138:49653/user/taskmanager) as
> b0c2566fcc0e2baf7e2605e96e5a6b9c. Current number of registered hosts is 7.
> Current number of alive task slots is 25.
> 09:56:45,448 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.156:33728] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:46,013 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.157:33728] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:46,630 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.158:33728] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:47,259 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.135:37219] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:47,788 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.137:35052] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:48,333 WARN  akka.remote.ReliableDeliverySupervisor
>        - Association with remote system [akka.tcp://
> flink@10.155.208.136:50432] has failed, address is now gated for [5000]
> ms. Reason is: [Disassociated].
> 09:56:48,629 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>       - Removing web root dir
> /tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8
> 09:56:48,635 INFO  org.apache.flink.runtime.blob.BlobServer
>        - Stopped BLOB server at 0.0.0.0:59504
>
> and One of the TaskManagers
>
> 09:55:36,694 WARN  org.apache.hadoop.util.NativeCodeLoader
>       - Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 09:55:36,918 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -
> --------------------------------------------------------------------------------
> 09:55:36,918 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
> Date:22.11.2015 @ 12:41:12 CET)
> 09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Current user: flink
> 09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
> 09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Maximum heap size: 990 MiBytes
> 09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Hadoop version: 2.7.0
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  JVM Options:
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -XX:+UseConcMarkSweepGC
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -XX:+CMSClassUnloadingEnabled
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -Xms1024M
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -Xmx1024M
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -XX:MaxDirectMemorySize=8388607T
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     -XX:MaxPermSize=256m
> 09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -
> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-138.cloud.mwn.de.log
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -
> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -
> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Program Arguments:
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     --configDir
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     /home/flink/flink-0.10.1/conf
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     --streamingMode
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -     streaming
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -  Classpath:
> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
> 09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        -
> --------------------------------------------------------------------------------
> 09:55:36,928 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Maximum number of open file descriptors is 4096
> 09:55:36,955 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Loading configuration from /home/flink/flink-0.10.1/conf
> 09:55:37,040 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Security is not enabled. Starting non-authenticated TaskManager.
> 09:55:37,071 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
>        - Trying to select the network interface and address to use by
> connecting to the leading JobManager.
> 09:55:37,072 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
>        - TaskManager will try to connect for 10000 milliseconds before
> falling back to heuristics
> 09:55:37,075 INFO  org.apache.flink.runtime.net.ConnectionUtils
>        - Retrieved new target address /10.155.208.156:6123.
> 09:55:37,084 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager will use hostname/address '
> vm-10-155-208-138.cloud.mwn.de' (10.155.208.138) for communication.
> 09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager in streaming mode STREAMING
> 09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor system at 10.155.208.138:0
> 09:55:37,531 INFO  akka.event.slf4j.Slf4jLogger
>        - Slf4jLogger started
> 09:55:37,587 INFO  Remoting
>        - Starting remoting
> 09:55:37,774 INFO  Remoting
>        - Remoting started; listening on addresses :[akka.tcp://
> flink@10.155.208.138:49653]
> 09:55:37,782 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor
> 09:55:37,798 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
>       - NettyConfig [server address:
> vm-10-155-208-138.cloud.mwn.de/10.155.208.138, server port: 32798, memory
> segment size (bytes): 32768, transport type: NIO, number of server threads:
> 0 (use Netty's default), number of client threads: 0 (use Netty's default),
> server connect backlog: 0 (use Netty's default), client connect timeout
> (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
> 09:55:37,803 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Messages between TaskManager and JobManager have a max timeout of
> 100000 milliseconds
> 09:55:37,811 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Temporary file directory '/tmp': total 4 GB, usable 0 GB (0.00%
> usable)
> 09:55:37,848 INFO
>  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
> 64 MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 09:55:37,955 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Using 0.5 of the currently free heap space for Flink managed heap
> memory (455 MB).
> 09:55:37,978 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>        - I/O manager uses directory
> /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f for spill files.
> 09:55:37,986 INFO  org.apache.flink.runtime.filecache.FileCache
>        - User file cache uses directory
> /tmp/flink-dist-cache-516dd09a-1dfe-46eb-b50b-b6e24b6e9fad
> 09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor at
> akka://flink/user/taskmanager#56985599.
> 09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager data connection information:
> vm-10-155-208-138.cloud.mwn.de (dataPort=32798)
> 09:55:38,147 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager has 4 task slot(s).
> 09:55:38,148 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Memory usage stats: [HEAP: 100/990/990 MB, NON HEAP: 24/37/304 MB
> (used/committed/max)]
> 09:55:38,151 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://
> flink@10.155.208.156:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 09:55:38,301 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Successful registration at JobManager (akka.tcp://
> flink@10.155.208.156:6123/user/jobmanager), starting network stack and
> library cache.
> 09:55:38,479 INFO  org.apache.flink.runtime.io.network.netty.NettyClient
>       - Successful initialization (took 55 ms).
> 09:55:38,533 INFO  org.apache.flink.runtime.io.network.netty.NettyServer
>       - Successful initialization (took 54 ms). Listening on SocketAddress /
> 10.155.208.138:32798.
> 09:55:38,534 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Determined BLOB server address to be /10.155.208.156:59504.
> Starting BLOB cache.
> 09:55:38,536 INFO  org.apache.flink.runtime.blob.BlobCache
>       - Created BLOB cache storage directory
> /tmp/blobStore-8e88302d-3303-4c80-8613-f0be13911fb2
> 09:56:48,371 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>        - I/O manager removed spill file directory
> /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f
>
> I also found that 4 of the deamons were not stopped after the cluster was
> stopped, though the JM claimed it had stopped these TM deamons.
>
> I have another question from you previous comment as to why is it
> necessary to allocate 50 GB of memory to taskmanager.heap.mb? Is the value
> I provided not sufficient for the job? If this is so, how was I able to run
> the previous examples with input datasets as large as 50GB error-free? It
> would help if you could provide some explanation here?
>
> Thank you.
>
> Kind Regards,
> Ravinder Kaur
>
>
>
>
> On Tue, Mar 15, 2016 at 11:01 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Ravinder,
>>
>> this should not be the relevant log extract. The log says that the TM is
>> started on port 49653 and the JM log says that the TM on port 42222 is
>> lost. Would you mind to share the complete JM and TM logs with us?
>>
>> Cheers,
>> Till
>>
>> On Tue, Mar 15, 2016 at 10:54 AM, Ravinder Kaur <neetu0...@gmail.com>
>> wrote:
>>
>>> Hello Ufuk,
>>>
>>> Yes, the same WordCount program is being run.
>>>
>>> Kind Regards,
>>> Ravinder Kaur
>>>
>>> On Tue, Mar 15, 2016 at 10:45 AM, Ufuk Celebi <u...@apache.org> wrote:
>>>
>>>> What do you mean with iteration in this context? Are you repeatedly
>>>> running the same WordCount program for streaming and batch
>>>> respectively?
>>>>
>>>> – Ufuk
>>>>
>>>> On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>> > Hi Ravinder,
>>>> >
>>>> > could you tell us what's written in the taskmanager log of the failing
>>>> > taskmanager? There should be some kind of failure why the taskmanager
>>>> > stopped working.
>>>> >
>>>> > Moreover, given that you have 64 GB of main memory, you could easily
>>>> give
>>>> > 50GB as heap memory to each taskmanager.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hello All,
>>>> >>
>>>> >> I'm running a simple word count example using the quickstart package
>>>> from
>>>> >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a
>>>> set of
>>>> >> randomly generated words of length 8.
>>>> >>
>>>> >> Cluster Configuration:
>>>> >>
>>>> >> Number of machines: 7
>>>> >> Total cores : 25
>>>> >> Memory on each: 64GB
>>>> >>
>>>> >> I'm interested in the performance measure between Batch and Stream
>>>> modes
>>>> >> and so I'm running WordCount example with number of iteration (max
>>>> 10) on
>>>> >> datasets of sizes ranging between 100MB and 50GB consisting of
>>>> random words
>>>> >> of length 4 and 8.
>>>> >>
>>>> >> While I ran the experiments in Batch mode all iterations ran fine,
>>>> but now
>>>> >> I'm stuck in the Streaming mode at this
>>>> >>
>>>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>> >>         at java.util.HashMap.resize(HashMap.java:580)
>>>> >>         at java.util.HashMap.addEntry(HashMap.java:879)
>>>> >>         at java.util.HashMap.put(HashMap.java:505)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
>>>> >>         at
>>>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>>> >>         at java.lang.Thread.run(Thread.java:745)
>>>> >>
>>>> >> I investigated found 2 solutions. (1) Increasing the
>>>> taskmanager.heap.mb
>>>> >> and (2) Reducing the taskmanager.memory.fraction
>>>> >>
>>>> >> Therefore I set taskmanager.heap.mb: 1024 and
>>>> taskmanager.memory.fraction:
>>>> >> 0.5 (default 0.7)
>>>> >>
>>>> >> When I ran the example with this setting I loose taskmanagers one by
>>>> one
>>>> >> during the job execution with the following cause
>>>> >>
>>>> >> Caused by: java.lang.Exception: The slot in which the task was
>>>> executed
>>>> >> has been released. Probably loss of TaskManager
>>>> >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
>>>> >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>>> >>         at
>>>> >>
>>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >>         at
>>>> >>
>>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >>         at
>>>> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >>         at
>>>> >>
>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >>         at
>>>> >>
>>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>>> >>         at
>>>> akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>>> >>         at
>>>> akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>>> >>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>>> >>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >>         at
>>>> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >>         at
>>>> >>
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>> >>         ... 2 more
>>>> >>
>>>> >>
>>>> >> While I look at the results generated at each taskmanager, they are
>>>> fine.
>>>> >> The logs also don't show any causes for the the job to get cancelled.
>>>> >>
>>>> >>
>>>> >> Could anyone kindly guide me here?
>>>> >>
>>>> >> Kind Regards,
>>>> >> Ravinder Kaur.
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: OutofMemoryError: Java heap space & Loss of Taskmanager

Reply via email to