[ 
https://issues.apache.org/jira/browse/FLINK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569062#comment-14569062
 ] 

Ufuk Celebi commented on FLINK-2133:
------------------------------------

I've looked at the ExecutionGraph and this seems to be a simple deadlock due to 
the ordering of lock acquisitions.

Two tasks of the same JobVertex aquire the locks in the following order:
- T1 (ForkJoinPool-1-worker-3): ExecutionGraph#restart() aquires 
ExecutionGraph#progressLock => ExecutionJobVertex#reset() aquires 
ExecutionJobVertex#stateMonitor
- T2 (flink-akka.actor.default-dispatcher-4): 
ExecutionJobVertex#subtaskInFinalState acquires ExecutionJobVertex#stateMonitor 
to cancel task => ExecutionGraph#jobVertexInFinalState() aquires 
ExecutionGraph#progressLock

I think that both messages have to be triggered by the same task, because both 
actions should only happen for the final vertex (I think cancel (transition to 
cancelling) and canceling complete msg (transition to cancelled)). 

> Possible deadlock in ExecutionGraph
> -----------------------------------
>
>                 Key: FLINK-2133
>                 URL: https://issues.apache.org/jira/browse/FLINK-2133
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Aljoscha Krettek
>
> I had the following output on Travis:
> {code}
> Found one Java-level deadlock:
> =============================
> "ForkJoinPool-1-worker-3":
>   waiting to lock monitor 0x00007f1c54af7eb8 (object 0x00000000d77fa8c0, a 
> org.apache.flink.runtime.util.SerializableObject),
>   which is held by "flink-akka.actor.default-dispatcher-4"
> "flink-akka.actor.default-dispatcher-4":
>   waiting to lock monitor 0x00007f1c5486aca0 (object 0x00000000d77fa218, a 
> org.apache.flink.runtime.util.SerializableObject),
>   which is held by "ForkJoinPool-1-worker-3"
> Java stack information for the threads listed above:
> ===================================================
> "ForkJoinPool-1-worker-3":
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:338)
>       - waiting to lock <0x00000000d77fa8c0> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:595)
>       - locked <0x00000000d77fa218> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph$3.call(ExecutionGraph.java:733)
>       at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>       at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "flink-akka.actor.default-dispatcher-4":
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVertexInFinalState(ExecutionGraph.java:683)
>       - waiting to lock <0x00000000d77fa218> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.subtaskInFinalState(ExecutionJobVertex.java:454)
>       - locked <0x00000000d77fa8c0> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.vertexCancelled(ExecutionJobVertex.java:426)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.executionCanceled(ExecutionVertex.java:565)
>       at 
> org.apache.flink.runtime.executiongraph.Execution.cancelingComplete(Execution.java:653)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:784)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:220)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to