I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.
2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > Yes , sorry for that..I found it somewhere in the logs..the problem was > that the program didn't die immediately but was somehow hanging and I > discovered the source of the problem only running the program on a subset > of the data. > > Thnks for the support, > Flavio > > On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <se...@apache.org> wrote: > >> This means that the TaskManager was lost. The JobManager can no longer >> reach the TaskManager and consists all tasks executing ob the TaskManager >> as failed. >> >> Have a look at the TaskManager log, it should describe why the >> TaskManager failed. >> Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <pomperma...@okkam.it>: >> >>> Hi to all, >>> >>> I have this strange error in my job and I don't know what's going on. >>> What can I do? >>> >>> The full exception is: >>> >>> The slot in which the task was scheduled has been killed (probably loss >>> of TaskManager). >>> at >>> org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98) >>> at >>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335) >>> at >>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319) >>> at >>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106) >>> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151) >>> at >>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) >>> at >>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> at >>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) >>> at >>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>> at >>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>> at >>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >>> >