loss of TaskManager

Boden, Christoph Thu, 25 Feb 2016 02:26:18 -0800

Dear Flink Community,

I am trying to fit a support vector machine classifier using the CoCoA 
implementation provided in flink/ml/classification/ on a data set of moderate 
size (400k data points, 2000 features, approx. 12GB) on a cluster of 25 nodes 
with 28 GB memory each - and each worker node is awarded the full 28GB in  
taskmanager.heap.mb.


With the standard configuration I constantly run into different versions of JVM 
HeapSpace OutOfMemory Errors. (e.g. com.esotericsoftware.kryo.KryoException: 
java.io.IOException: Failed to serialize element. Serialized size (> 276647402 
bytes) exceeds JVM heap space - Serialization trace: data 
(org.apache.flink.ml.math.SparseVector) ... ")

As changing DOP did not alter anything, I significantly reduced the 
taskmanager.memory.fraction. With this I now (reproducibly) run into the 
following problem. 

After running for a while, the job fails with the following error:

java.lang.Exception: The slot in which the task was executed has been released. 
Probably loss of TaskManager  @ host slots - URL: akka.tcp://flink@url
2/user/taskmanager
        at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
        at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


However the log of the taskmanager in question does not show any error or 
exception in its log. The last log entry is:

2016-02-25 09:38:12,543 INFO  
org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask  - 
finishing iteration [2]:  Combine (Reduce at 
org.apache.flink.ml.classification.SVM$$anon$25$$anonfun$6.apply(SVM.scala:392))
 (3/91)

I am somewhat puzzled what could be the cause of this. Any help, or pointers to 
appropriate documentation would be greatly appreciated. 

I'll try increasing the heartbeat intervals next, but would still like to 
understand what goes wrong here.

Best regards,
Christoph

loss of TaskManager

Reply via email to