[ https://issues.apache.org/jira/browse/FLINK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569062#comment-14569062 ]
Ufuk Celebi commented on FLINK-2133: ------------------------------------ I've looked at the ExecutionGraph and this seems to be a simple deadlock due to the ordering of lock acquisitions. Two tasks of the same JobVertex aquire the locks in the following order: - T1 (ForkJoinPool-1-worker-3): ExecutionGraph#restart() aquires ExecutionGraph#progressLock => ExecutionJobVertex#reset() aquires ExecutionJobVertex#stateMonitor - T2 (flink-akka.actor.default-dispatcher-4): ExecutionJobVertex#subtaskInFinalState acquires ExecutionJobVertex#stateMonitor to cancel task => ExecutionGraph#jobVertexInFinalState() aquires ExecutionGraph#progressLock I think that both messages have to be triggered by the same task, because both actions should only happen for the final vertex (I think cancel (transition to cancelling) and canceling complete msg (transition to cancelled)). > Possible deadlock in ExecutionGraph > ----------------------------------- > > Key: FLINK-2133 > URL: https://issues.apache.org/jira/browse/FLINK-2133 > Project: Flink > Issue Type: Bug > Reporter: Aljoscha Krettek > > I had the following output on Travis: > {code} > Found one Java-level deadlock: > ============================= > "ForkJoinPool-1-worker-3": > waiting to lock monitor 0x00007f1c54af7eb8 (object 0x00000000d77fa8c0, a > org.apache.flink.runtime.util.SerializableObject), > which is held by "flink-akka.actor.default-dispatcher-4" > "flink-akka.actor.default-dispatcher-4": > waiting to lock monitor 0x00007f1c5486aca0 (object 0x00000000d77fa218, a > org.apache.flink.runtime.util.SerializableObject), > which is held by "ForkJoinPool-1-worker-3" > Java stack information for the threads listed above: > =================================================== > "ForkJoinPool-1-worker-3": > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:338) > - waiting to lock <0x00000000d77fa8c0> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:595) > - locked <0x00000000d77fa218> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph$3.call(ExecutionGraph.java:733) > at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "flink-akka.actor.default-dispatcher-4": > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVertexInFinalState(ExecutionGraph.java:683) > - waiting to lock <0x00000000d77fa218> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.subtaskInFinalState(ExecutionJobVertex.java:454) > - locked <0x00000000d77fa8c0> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.vertexCancelled(ExecutionJobVertex.java:426) > at > org.apache.flink.runtime.executiongraph.ExecutionVertex.executionCanceled(ExecutionVertex.java:565) > at > org.apache.flink.runtime.executiongraph.Execution.cancelingComplete(Execution.java:653) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:784) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:220) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Found 1 deadlock. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)