Hey Till, I managed to reproduce the bug; The logs are in the corresponding JIRA [hopefully I got the right ones :)]: FLINK-2299 <https://issues.apache.org/jira/browse/FLINK-2299>
As a side line. Guys, these two issues (FLINK-2299 <https://issues.apache.org/jira/browse/FLINK-2299> and FLINK-2293 <https://issues.apache.org/jira/browse/FLINK-2293>) are pretty much show stoppers for my work; this means that I can offer support to get them fixed: i.e. modify parameter x on the cluster; try running it now etc. And I would appreciate it if someone can look into them and pass me a status update. Thanks a lot! Sorry for the inconvenience! Andra On Tue, Jun 30, 2015 at 10:51 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Do you have the JobManager and TaskManager logs of the corresponding TM, > by any chance? > > On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <lungu.an...@gmail.com> > wrote: > >> Something similar in flink-0.10-SNAPSHOT: >> >> 06/29/2015 10:33:46 CHAIN Join(Join at main(TriangleCount.java:79)) >> -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to >> FAILED >> java.lang.Exception: The slot in which the task was executed has been >> released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ >> wally025 - 8 slots - URL: akka.tcp:// >> flink@130.149.249.35:56135/user/taskmanager >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >> at >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> 06/29/2015 10:33:46 Job execution switched to status FAILING. >> >> >> On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov < >> alexander.s.alexand...@gmail.com> wrote: >> >>> I witnessed a similar issue yesterday on a simple job (single task >>> chain, no shuffles) with a release-0.9 based fork. >>> >>> 2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>> >>>> Yes , sorry for that..I found it somewhere in the logs..the problem was >>>> that the program didn't die immediately but was somehow hanging and I >>>> discovered the source of the problem only running the program on a subset >>>> of the data. >>>> >>>> Thnks for the support, >>>> Flavio >>>> >>>> On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <se...@apache.org> wrote: >>>> >>>>> This means that the TaskManager was lost. The JobManager can no longer >>>>> reach the TaskManager and consists all tasks executing ob the TaskManager >>>>> as failed. >>>>> >>>>> Have a look at the TaskManager log, it should describe why the >>>>> TaskManager failed. >>>>> Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <pomperma...@okkam.it >>>>> >: >>>>> >>>>>> Hi to all, >>>>>> >>>>>> I have this strange error in my job and I don't know what's going on. >>>>>> What can I do? >>>>>> >>>>>> The full exception is: >>>>>> >>>>>> The slot in which the task was scheduled has been killed (probably >>>>>> loss of TaskManager). >>>>>> at >>>>>> org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319) >>>>>> at >>>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106) >>>>>> at >>>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151) >>>>>> at >>>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) >>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) >>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94) >>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>> >>>>>> >>>> >>> >> >