Hi Ravinder, the log of the TM you've sent is the log of the only TM which has not been disassociated from the JM. Can it be that you simply stopped the cluster which results in the disassociation events?
Normally, Flink should kill all processes. If you have some processes lingering around, then you should kill them first. The more memory you provide the more data can be kept in memory. Whenever the managed memory is full, then it will be spilled to disk. That's how you can also process data which does not fit completely into memory. However, all elements which are given to a user function will be kept on the heap space. If it now happens that your elements become too big or you keep too many elements on the heap, you'll see an OOM exception. Then it helps if you increase the assigned memory or lower the memory fraction. Cheers, Till On Tue, Mar 15, 2016 at 11:17 AM, Ravinder Kaur <neetu0...@gmail.com> wrote: > Hi Till, > > Log of JobManager > > 09:55:31,574 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > 09:55:31,742 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -------------------------------------------------------------------------------- > 09:55:31,742 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager (Version: 0.10.1, Rev:2e9b231, > Date:22.11.2015 @ 12:41:12 CET) > 09:55:31,742 INFO org.apache.flink.runtime.jobmanager.JobManager > - Current user: flink > 09:55:31,742 INFO org.apache.flink.runtime.jobmanager.JobManager > - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.95-b01 > 09:55:31,743 INFO org.apache.flink.runtime.jobmanager.JobManager > - Maximum heap size: 246 MiBytes > 09:55:31,743 INFO org.apache.flink.runtime.jobmanager.JobManager > - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 > 09:55:31,745 INFO org.apache.flink.runtime.jobmanager.JobManager > - Hadoop version: 2.7.0 > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - JVM Options: > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xms256m > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xmx256m > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - -XX:MaxPermSize=256m > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - Program Arguments: > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - --configDir > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - /home/flink/flink-0.10.1/conf > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - --executionMode > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - cluster > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - --streamingMode > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - streaming > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - Classpath: > /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar::: > 09:55:31,746 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -------------------------------------------------------------------------------- > 09:55:31,924 INFO org.apache.flink.runtime.jobmanager.JobManager > - Loading configuration from /home/flink/flink-0.10.1/conf > 09:55:31,941 INFO org.apache.flink.runtime.jobmanager.JobManager > - Staring JobManager without high-availability > 09:55:31,950 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager on 10.155.208.156:6123 with execution mode > CLUSTER and streaming mode STREAMING > 09:55:32,039 INFO org.apache.flink.runtime.jobmanager.JobManager > - Security is not enabled. Starting non-authenticated JobManager. > 09:55:32,039 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager > 09:55:32,040 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor system at 10.155.208.156:6123 > 09:55:32,483 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 09:55:32,564 INFO Remoting > - Starting remoting > 09:55:32,730 INFO Remoting > - Remoting started; listening on addresses :[akka.tcp:// > flink@10.155.208.156:6123] > 09:55:32,731 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManger web frontend > 09:55:32,761 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Using directory /tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8 > for the web interface files > 09:55:32,762 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Serving job manager log from > /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log > 09:55:32,762 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Serving job manager stdout from > /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.out > 09:55:33,040 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Web frontend listening at 0:0:0:0:0:0:0:0:8081 > 09:55:33,041 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor > 09:55:33,046 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /tmp/blobStore-28cbb318-efc7-4a8b-85a0-1ea6539f1ba8 > 09:55:33,047 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:59504 - max concurrent requests: > 50 - max backlog: 1000 > 09:55:33,057 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager at akka.tcp:// > flink@10.155.208.156:6123/user/jobmanager. > 09:55:33,060 INFO org.apache.flink.runtime.jobmanager.JobManager > - JobManager akka.tcp://flink@10.155.208.156:6123/user/jobmanager > was granted leadership with leader session ID None. > 09:55:33,063 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist > - Started memory archivist akka://flink/user/archive > 09:55:33,064 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Starting with JobManager akka.tcp:// > flink@10.155.208.156:6123/user/jobmanager on port 8081 > 09:55:33,064 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever > - New leader reachable under akka.tcp:// > flink@10.155.208.156:6123/user/jobmanager:null. > 09:55:34,013 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at vm-10-155-208-156 (akka.tcp:// > flink@10.155.208.156:33728/user/taskmanager) as > 9b665ef16d88314bf37f816b5afdfe79. Current number of registered hosts is 1. > Current number of alive task slots is 3. > 09:55:34,735 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at vm-10-155-208-157 (akka.tcp:// > flink@10.155.208.157:33728/user/taskmanager) as > fc8b661bb0ad5568855c8c2d6a0029f2. Current number of registered hosts is 2. > Current number of alive task slots is 6. > 09:55:35,394 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at vm-10-155-208-158 (akka.tcp:// > flink@10.155.208.158:33728/user/taskmanager) as > c12e0ca481aaa4bdd6fa6be4cfc8335e. Current number of registered hosts is 3. > Current number of alive task slots is 9. > 09:55:36,105 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at slave3 (akka.tcp:// > flink@10.155.208.135:37219/user/taskmanager) as > 6e023a6417743d1e5d67410dc76824b8. Current number of registered hosts is 4. > Current number of alive task slots is 13. > 09:55:36,542 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at vm-10-155-208-137 (akka.tcp:// > flink@10.155.208.137:35052/user/taskmanager) as > c0098ff87ca76c958ca18a4e7d30e24e. Current number of registered hosts is 5. > Current number of alive task slots is 17. > 09:55:37,422 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at slave2 (akka.tcp:// > flink@10.155.208.136:50432/user/taskmanager) as > d62208a25ce877757aeb72fe4b6530fc. Current number of registered hosts is 6. > Current number of alive task slots is 21. > 09:55:38,351 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at vm-10-155-208-138 (akka.tcp:// > flink@10.155.208.138:49653/user/taskmanager) as > b0c2566fcc0e2baf7e2605e96e5a6b9c. Current number of registered hosts is 7. > Current number of alive task slots is 25. > 09:56:45,448 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.156:33728] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:46,013 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.157:33728] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:46,630 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.158:33728] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:47,259 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.135:37219] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:47,788 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.137:35052] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:48,333 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > flink@10.155.208.136:50432] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > 09:56:48,629 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Removing web root dir > /tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8 > 09:56:48,635 INFO org.apache.flink.runtime.blob.BlobServer > - Stopped BLOB server at 0.0.0.0:59504 > > and One of the TaskManagers > > 09:55:36,694 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > 09:55:36,918 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -------------------------------------------------------------------------------- > 09:55:36,918 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager (Version: 0.10.1, Rev:2e9b231, > Date:22.11.2015 @ 12:41:12 CET) > 09:55:36,919 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Current user: flink > 09:55:36,919 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01 > 09:55:36,919 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Maximum heap size: 990 MiBytes > 09:55:36,919 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Hadoop version: 2.7.0 > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - JVM Options: > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:+UseConcMarkSweepGC > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:+CMSClassUnloadingEnabled > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xms1024M > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -Xmx1024M > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:MaxDirectMemorySize=8388607T > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - -XX:MaxPermSize=256m > 09:55:36,922 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-138.cloud.mwn.de.log > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Program Arguments: > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - --configDir > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - /home/flink/flink-0.10.1/conf > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - --streamingMode > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - streaming > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Classpath: > /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar:: > 09:55:36,923 INFO org.apache.flink.runtime.taskmanager.TaskManager > - > -------------------------------------------------------------------------------- > 09:55:36,928 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Maximum number of open file descriptors is 4096 > 09:55:36,955 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Loading configuration from /home/flink/flink-0.10.1/conf > 09:55:37,040 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Security is not enabled. Starting non-authenticated TaskManager. > 09:55:37,071 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils > - Trying to select the network interface and address to use by > connecting to the leading JobManager. > 09:55:37,072 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils > - TaskManager will try to connect for 10000 milliseconds before > falling back to heuristics > 09:55:37,075 INFO org.apache.flink.runtime.net.ConnectionUtils > - Retrieved new target address /10.155.208.156:6123. > 09:55:37,084 INFO org.apache.flink.runtime.taskmanager.TaskManager > - TaskManager will use hostname/address ' > vm-10-155-208-138.cloud.mwn.de' (10.155.208.138) for communication. > 09:55:37,085 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager in streaming mode STREAMING > 09:55:37,085 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager actor system at 10.155.208.138:0 > 09:55:37,531 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 09:55:37,587 INFO Remoting > - Starting remoting > 09:55:37,774 INFO Remoting > - Remoting started; listening on addresses :[akka.tcp:// > flink@10.155.208.138:49653] > 09:55:37,782 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager actor > 09:55:37,798 INFO org.apache.flink.runtime.io.network.netty.NettyConfig > - NettyConfig [server address: > vm-10-155-208-138.cloud.mwn.de/10.155.208.138, server port: 32798, memory > segment size (bytes): 32768, transport type: NIO, number of server threads: > 0 (use Netty's default), number of client threads: 0 (use Netty's default), > server connect backlog: 0 (use Netty's default), client connect timeout > (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] > 09:55:37,803 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Messages between TaskManager and JobManager have a max timeout of > 100000 milliseconds > 09:55:37,811 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Temporary file directory '/tmp': total 4 GB, usable 0 GB (0.00% > usable) > 09:55:37,848 INFO > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated > 64 MB for network buffer pool (number of memory segments: 2048, bytes per > segment: 32768). > 09:55:37,955 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Using 0.5 of the currently free heap space for Flink managed heap > memory (455 MB). > 09:55:37,978 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager > - I/O manager uses directory > /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f for spill files. > 09:55:37,986 INFO org.apache.flink.runtime.filecache.FileCache > - User file cache uses directory > /tmp/flink-dist-cache-516dd09a-1dfe-46eb-b50b-b6e24b6e9fad > 09:55:38,146 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Starting TaskManager actor at > akka://flink/user/taskmanager#56985599. > 09:55:38,146 INFO org.apache.flink.runtime.taskmanager.TaskManager > - TaskManager data connection information: > vm-10-155-208-138.cloud.mwn.de (dataPort=32798) > 09:55:38,147 INFO org.apache.flink.runtime.taskmanager.TaskManager > - TaskManager has 4 task slot(s). > 09:55:38,148 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Memory usage stats: [HEAP: 100/990/990 MB, NON HEAP: 24/37/304 MB > (used/committed/max)] > 09:55:38,151 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Trying to register at JobManager akka.tcp:// > flink@10.155.208.156:6123/user/jobmanager (attempt 1, timeout: 500 > milliseconds) > 09:55:38,301 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Successful registration at JobManager (akka.tcp:// > flink@10.155.208.156:6123/user/jobmanager), starting network stack and > library cache. > 09:55:38,479 INFO org.apache.flink.runtime.io.network.netty.NettyClient > - Successful initialization (took 55 ms). > 09:55:38,533 INFO org.apache.flink.runtime.io.network.netty.NettyServer > - Successful initialization (took 54 ms). Listening on SocketAddress / > 10.155.208.138:32798. > 09:55:38,534 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Determined BLOB server address to be /10.155.208.156:59504. > Starting BLOB cache. > 09:55:38,536 INFO org.apache.flink.runtime.blob.BlobCache > - Created BLOB cache storage directory > /tmp/blobStore-8e88302d-3303-4c80-8613-f0be13911fb2 > 09:56:48,371 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager > - I/O manager removed spill file directory > /tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f > > I also found that 4 of the deamons were not stopped after the cluster was > stopped, though the JM claimed it had stopped these TM deamons. > > I have another question from you previous comment as to why is it > necessary to allocate 50 GB of memory to taskmanager.heap.mb? Is the value > I provided not sufficient for the job? If this is so, how was I able to run > the previous examples with input datasets as large as 50GB error-free? It > would help if you could provide some explanation here? > > Thank you. > > Kind Regards, > Ravinder Kaur > > > > > On Tue, Mar 15, 2016 at 11:01 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Ravinder, >> >> this should not be the relevant log extract. The log says that the TM is >> started on port 49653 and the JM log says that the TM on port 42222 is >> lost. Would you mind to share the complete JM and TM logs with us? >> >> Cheers, >> Till >> >> On Tue, Mar 15, 2016 at 10:54 AM, Ravinder Kaur <neetu0...@gmail.com> >> wrote: >> >>> Hello Ufuk, >>> >>> Yes, the same WordCount program is being run. >>> >>> Kind Regards, >>> Ravinder Kaur >>> >>> On Tue, Mar 15, 2016 at 10:45 AM, Ufuk Celebi <u...@apache.org> wrote: >>> >>>> What do you mean with iteration in this context? Are you repeatedly >>>> running the same WordCount program for streaming and batch >>>> respectively? >>>> >>>> – Ufuk >>>> >>>> On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> > Hi Ravinder, >>>> > >>>> > could you tell us what's written in the taskmanager log of the failing >>>> > taskmanager? There should be some kind of failure why the taskmanager >>>> > stopped working. >>>> > >>>> > Moreover, given that you have 64 GB of main memory, you could easily >>>> give >>>> > 50GB as heap memory to each taskmanager. >>>> > >>>> > Cheers, >>>> > Till >>>> > >>>> > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0...@gmail.com> >>>> wrote: >>>> >> >>>> >> Hello All, >>>> >> >>>> >> I'm running a simple word count example using the quickstart package >>>> from >>>> >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a >>>> set of >>>> >> randomly generated words of length 8. >>>> >> >>>> >> Cluster Configuration: >>>> >> >>>> >> Number of machines: 7 >>>> >> Total cores : 25 >>>> >> Memory on each: 64GB >>>> >> >>>> >> I'm interested in the performance measure between Batch and Stream >>>> modes >>>> >> and so I'm running WordCount example with number of iteration (max >>>> 10) on >>>> >> datasets of sizes ranging between 100MB and 50GB consisting of >>>> random words >>>> >> of length 4 and 8. >>>> >> >>>> >> While I ran the experiments in Batch mode all iterations ran fine, >>>> but now >>>> >> I'm stuck in the Streaming mode at this >>>> >> >>>> >> Caused by: java.lang.OutOfMemoryError: Java heap space >>>> >> at java.util.HashMap.resize(HashMap.java:580) >>>> >> at java.util.HashMap.addEntry(HashMap.java:879) >>>> >> at java.util.HashMap.put(HashMap.java:505) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98) >>>> >> at >>>> >> >>>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59) >>>> >> at >>>> >> >>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) >>>> >> at >>>> >> >>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) >>>> >> at >>>> >> >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218) >>>> >> at >>>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) >>>> >> at java.lang.Thread.run(Thread.java:745) >>>> >> >>>> >> I investigated found 2 solutions. (1) Increasing the >>>> taskmanager.heap.mb >>>> >> and (2) Reducing the taskmanager.memory.fraction >>>> >> >>>> >> Therefore I set taskmanager.heap.mb: 1024 and >>>> taskmanager.memory.fraction: >>>> >> 0.5 (default 0.7) >>>> >> >>>> >> When I ran the example with this setting I loose taskmanagers one by >>>> one >>>> >> during the job execution with the following cause >>>> >> >>>> >> Caused by: java.lang.Exception: The slot in which the task was >>>> executed >>>> >> has been released. Probably loss of TaskManager >>>> >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL: >>>> >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager >>>> >> at >>>> >> >>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >>>> >> at >>>> >> >>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>> >> at >>>> >> >>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>> >> at >>>> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>> >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>> >> at >>>> >> >>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> >> at >>>> >> >>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >>>> >> at >>>> akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >>>> >> at >>>> akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >>>> >> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >>>> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>> >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>> >> at >>>> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >> at >>>> >> >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> >> ... 2 more >>>> >> >>>> >> >>>> >> While I look at the results generated at each taskmanager, they are >>>> fine. >>>> >> The logs also don't show any causes for the the job to get cancelled. >>>> >> >>>> >> >>>> >> Could anyone kindly guide me here? >>>> >> >>>> >> Kind Regards, >>>> >> Ravinder Kaur. >>>> > >>>> > >>>> >>> >>> >> >