[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611939#comment-14611939 ]
Andra Lungu edited comment on FLINK-2299 at 7/2/15 1:14 PM: ------------------------------------------------------------ The JM seems to have died at this time: 00:46:06,185 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@130.149.249.12:36710] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. However, the TMs were all trying to register it at 00:34:52,404 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@130.149.249.11:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds) and could not... And in the JM, they seem to be registered: 00:34:49,034 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Started web info server for JobManager on 0.0.0.0:8081 00:34:51,807 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at wally003 (akka.tcp://flink@130.149.249.13:41371/user/taskmanager) as 432442efcde05962b9fd8703399b692e. Current number of registered hosts is 1. 00:34:51,876 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at wally002 (akka.tcp://flink@130.149.249.12:36710/user/taskmanager) as d958195edfa9d98fb0b3f83da41af5aa. Current number of registered hosts is 2. .... At 00:46 the TMs were already doing this: 00:46:06,387 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /data/andra.lungu/flink_tmp/flink-io-27b23260-80ec-4f8a-9163-5cad90b12be7 I will not be able to run experiments until Sunday morning, but I still have the logs. I may not be looking in the right place?! So as soon as I get my nodes I will increase the heartbeat interval, hopefully that will do the trick :) [Nevertheless, I would like to hear your opinion on the above log snippet] Thanks! was (Author: andralungu): The JM seems to have died at this time: 00:46:06,185 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@130.149.249.12:36710] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. However, the TMs were all trying to register it at 00:34:52,404 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@130.149.249.11:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds) and could not... And in the JM, they seem to be registered: 00:34:49,034 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Started web info server for JobManager on 0.0.0.0:8081 00:34:51,807 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at wally003 (akka.tcp://flink@130.149.249.13:41371/user/taskmanager) as 432442efcde05962b9fd8703399b692e. Current number of registered hosts is 1. 00:34:51,876 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at wally002 (akka.tcp://flink@130.149.249.12:36710/user/taskmanager) as d958195edfa9d98fb0b3f83da41af5aa. Current number of registered hosts is 2. .... At 00:46 the TMs were already doing this: 00:46:06,387 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /data/andra.lungu/flink_tmp/flink-io-27b23260-80ec-4f8a-9163-5cad90b12be7 I will not be able to run experiments until Sunday morning, but I still have the logs. I may not be looking in the right place?! So as soon as I get my nodes I will increase the heartbeat interval, hopefully that will do the trick :) > The slot on which the task maanger was scheduled was killed > ----------------------------------------------------------- > > Key: FLINK-2299 > URL: https://issues.apache.org/jira/browse/FLINK-2299 > Project: Flink > Issue Type: Bug > Affects Versions: 0.9, 0.10 > Reporter: Andra Lungu > Priority: Critical > Fix For: 0.9.1 > > > The following code: > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java > Ran on the twitter follower graph: > http://twitter.mpi-sws.org/data-icwsm2010.html > With a similar configuration to the one in FLINK-2293 > fails with the following exception: > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ > wally025 - 8 slots - URL: > akka.tcp://flink@130.149.249.35:56135/user/taskmanager > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 06/29/2015 10:33:46 Job execution switched to status FAILING. > The logs are here: > https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)