[jira] [Commented] (FLINK-2133) Possible deadlock in ExecutionGraph

Stephan Ewen (JIRA) Fri, 05 Jun 2015 12:06:49 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575036#comment-14575036
 ]


Stephan Ewen commented on FLINK-2133:
-------------------------------------

Okay, I found a plausible scenario how this can happen: (It is a super hard 
race)

  - During canceling, the {{ExecutionJobVertices}} cancel simultaneously 
(vertex1 and vertex2)
  - {{Vertex 1}} transitions into its final state
  - In the executiongraph, it transitions the counter to the next vertex to 
check/wait for to {{vertex 2}} and checks if that one is finished already
  - {{Vertex 2}} is just done with its final subtask canceling has reached the 
state where it increments the number of terminal subtasks (ExecutionJobvertex, 
after 448, but before 454) 
  - The thread that finished {{vertex 1}} recognizes that this considers 
{{vertex 2}} terminal and marks the job entirely as complete. It triggers 
restart.
  - {{Vertex 2}} tries to tell the ExecutionGraph that it reached a terminal 
state and cannot acquire the lock any more that it needs to learn that its 
transition to terminal has already been registered.

==> Deadlock

There is a simple way to fix this, but I am not sure if there is any reasonable 
way to test this. Seems that one needs to provoke this insanely exact timed 
race between the threads to provoke that situation.

> Possible deadlock in ExecutionGraph
> -----------------------------------
>
>                 Key: FLINK-2133
>                 URL: https://issues.apache.org/jira/browse/FLINK-2133
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Aljoscha Krettek
>
> I had the following output on Travis:
> {code}
> Found one Java-level deadlock:
> =============================
> "ForkJoinPool-1-worker-3":
>   waiting to lock monitor 0x00007f1c54af7eb8 (object 0x00000000d77fa8c0, a 
> org.apache.flink.runtime.util.SerializableObject),
>   which is held by "flink-akka.actor.default-dispatcher-4"
> "flink-akka.actor.default-dispatcher-4":
>   waiting to lock monitor 0x00007f1c5486aca0 (object 0x00000000d77fa218, a 
> org.apache.flink.runtime.util.SerializableObject),
>   which is held by "ForkJoinPool-1-worker-3"
> Java stack information for the threads listed above:
> ===================================================
> "ForkJoinPool-1-worker-3":
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:338)
>       - waiting to lock <0x00000000d77fa8c0> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:595)
>       - locked <0x00000000d77fa218> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph$3.call(ExecutionGraph.java:733)
>       at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>       at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "flink-akka.actor.default-dispatcher-4":
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVertexInFinalState(ExecutionGraph.java:683)
>       - waiting to lock <0x00000000d77fa218> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.subtaskInFinalState(ExecutionJobVertex.java:454)
>       - locked <0x00000000d77fa8c0> (a 
> org.apache.flink.runtime.util.SerializableObject)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.vertexCancelled(ExecutionJobVertex.java:426)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.executionCanceled(ExecutionVertex.java:565)
>       at 
> org.apache.flink.runtime.executiongraph.Execution.cancelingComplete(Execution.java:653)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:784)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:220)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
>       at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>       at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2133) Possible deadlock in ExecutionGraph

Reply via email to