[ https://issues.apache.org/jira/browse/FLINK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575036#comment-14575036 ]
Stephan Ewen commented on FLINK-2133: ------------------------------------- Okay, I found a plausible scenario how this can happen: (It is a super hard race) - During canceling, the {{ExecutionJobVertices}} cancel simultaneously (vertex1 and vertex2) - {{Vertex 1}} transitions into its final state - In the executiongraph, it transitions the counter to the next vertex to check/wait for to {{vertex 2}} and checks if that one is finished already - {{Vertex 2}} is just done with its final subtask canceling has reached the state where it increments the number of terminal subtasks (ExecutionJobvertex, after 448, but before 454) - The thread that finished {{vertex 1}} recognizes that this considers {{vertex 2}} terminal and marks the job entirely as complete. It triggers restart. - {{Vertex 2}} tries to tell the ExecutionGraph that it reached a terminal state and cannot acquire the lock any more that it needs to learn that its transition to terminal has already been registered. ==> Deadlock There is a simple way to fix this, but I am not sure if there is any reasonable way to test this. Seems that one needs to provoke this insanely exact timed race between the threads to provoke that situation. > Possible deadlock in ExecutionGraph > ----------------------------------- > > Key: FLINK-2133 > URL: https://issues.apache.org/jira/browse/FLINK-2133 > Project: Flink > Issue Type: Bug > Reporter: Aljoscha Krettek > > I had the following output on Travis: > {code} > Found one Java-level deadlock: > ============================= > "ForkJoinPool-1-worker-3": > waiting to lock monitor 0x00007f1c54af7eb8 (object 0x00000000d77fa8c0, a > org.apache.flink.runtime.util.SerializableObject), > which is held by "flink-akka.actor.default-dispatcher-4" > "flink-akka.actor.default-dispatcher-4": > waiting to lock monitor 0x00007f1c5486aca0 (object 0x00000000d77fa218, a > org.apache.flink.runtime.util.SerializableObject), > which is held by "ForkJoinPool-1-worker-3" > Java stack information for the threads listed above: > =================================================== > "ForkJoinPool-1-worker-3": > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:338) > - waiting to lock <0x00000000d77fa8c0> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:595) > - locked <0x00000000d77fa218> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph$3.call(ExecutionGraph.java:733) > at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "flink-akka.actor.default-dispatcher-4": > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVertexInFinalState(ExecutionGraph.java:683) > - waiting to lock <0x00000000d77fa218> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.subtaskInFinalState(ExecutionJobVertex.java:454) > - locked <0x00000000d77fa8c0> (a > org.apache.flink.runtime.util.SerializableObject) > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.vertexCancelled(ExecutionJobVertex.java:426) > at > org.apache.flink.runtime.executiongraph.ExecutionVertex.executionCanceled(ExecutionVertex.java:565) > at > org.apache.flink.runtime.executiongraph.Execution.cancelingComplete(Execution.java:653) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:784) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:220) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Found 1 deadlock. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)