[jira] [Comment Edited] (FLINK-2299) The slot on which the task maanger was scheduled was killed

Andra Lungu (JIRA) Thu, 02 Jul 2015 06:15:39 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611939#comment-14611939
 ]


Andra Lungu edited comment on FLINK-2299 at 7/2/15 1:14 PM:
------------------------------------------------------------

The JM seems to have died at this time: 
00:46:06,185 WARN  akka.remote.ReliableDeliverySupervisor                       
 - Association with remote system [akka.tcp://flink@130.149.249.12:36710] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

However, the TMs were all trying to register it at 
00:34:52,404 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
 - Trying to register at JobManager 
akka.tcp://flink@130.149.249.11:6123/user/jobmanager (attempt 1, timeout: 500 
milliseconds)

and could not... 

And in the JM, they seem to be registered:
00:34:49,034 INFO  org.apache.flink.runtime.jobmanager.web.WebInfoServer        
 - Started web info server for JobManager on 0.0.0.0:8081
00:34:51,807 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at wally003 
(akka.tcp://flink@130.149.249.13:41371/user/taskmanager) as 
432442efcde05962b9fd8703399b692e. Current number of registered hosts is 1.
00:34:51,876 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at wally002 
(akka.tcp://flink@130.149.249.12:36710/user/taskmanager) as 
d958195edfa9d98fb0b3f83da41af5aa. Current number of registered hosts is 2.
....

At 00:46 the TMs were already doing this: 
00:46:06,387 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager         
 - I/O manager removed spill file directory 
/data/andra.lungu/flink_tmp/flink-io-27b23260-80ec-4f8a-9163-5cad90b12be7

I will not be able to run experiments until Sunday morning, but I still have 
the logs. I may not be looking in the right place?! So as soon as I get my 
nodes I will increase the heartbeat interval, hopefully that will do the trick 
:)

[Nevertheless, I would like to hear your opinion on the above log snippet] 
Thanks!




was (Author: andralungu):
The JM seems to have died at this time: 
00:46:06,185 WARN  akka.remote.ReliableDeliverySupervisor                       
 - Association with remote system [akka.tcp://flink@130.149.249.12:36710] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

However, the TMs were all trying to register it at 
00:34:52,404 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
 - Trying to register at JobManager 
akka.tcp://flink@130.149.249.11:6123/user/jobmanager (attempt 1, timeout: 500 
milliseconds)

and could not... 

And in the JM, they seem to be registered:
00:34:49,034 INFO  org.apache.flink.runtime.jobmanager.web.WebInfoServer        
 - Started web info server for JobManager on 0.0.0.0:8081
00:34:51,807 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at wally003 
(akka.tcp://flink@130.149.249.13:41371/user/taskmanager) as 
432442efcde05962b9fd8703399b692e. Current number of registered hosts is 1.
00:34:51,876 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at wally002 
(akka.tcp://flink@130.149.249.12:36710/user/taskmanager) as 
d958195edfa9d98fb0b3f83da41af5aa. Current number of registered hosts is 2.
....

At 00:46 the TMs were already doing this: 
00:46:06,387 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager         
 - I/O manager removed spill file directory 
/data/andra.lungu/flink_tmp/flink-io-27b23260-80ec-4f8a-9163-5cad90b12be7

I will not be able to run experiments until Sunday morning, but I still have 
the logs. I may not be looking in the right place?! So as soon as I get my 
nodes I will increase the heartbeat interval, hopefully that will do the trick 
:)



> The slot on which the task maanger was scheduled was killed
> -----------------------------------------------------------
>
>                 Key: FLINK-2299
>                 URL: https://issues.apache.org/jira/browse/FLINK-2299
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 0.9, 0.10
>            Reporter: Andra Lungu
>            Priority: Critical
>             Fix For: 0.9.1
>
>
> The following code: 
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
> Ran on the twitter follower graph: 
> http://twitter.mpi-sws.org/data-icwsm2010.html 
> With a similar configuration to the one in FLINK-2293
> fails with the following exception:
> java.lang.Exception: The slot in which the task was executed has been 
> released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
> wally025 - 8 slots - URL: 
> akka.tcp://flink@130.149.249.35:56135/user/taskmanager
>         at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>         at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>         at 
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>         at 
> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
>         at 
> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
>         at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>         at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>         at 
> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 06/29/2015 10:33:46     Job execution switched to status FAILING.
> The logs are here:
> https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (FLINK-2299) The slot on which the task maanger was scheduled was killed

Reply via email to