[ 
https://issues.apache.org/jira/browse/FLINK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327793#comment-17327793
 ] 

Flink Jira Bot commented on FLINK-18167:
----------------------------------------

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Flink Job hangs there when one vertex is failed and another is cancelled. 
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18167
>                 URL: https://issues.apache.org/jira/browse/FLINK-18167
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.10.0
>            Reporter: Jeff Zhang
>            Priority: Major
>              Labels: stale-major
>         Attachments: image-2020-06-06-15-39-35-441.png
>
>
> After I call cancel with savepoint, the cancel operation is failed. The 
> following is what I see in client side. 
> {code:java}
> WARN [2020-06-06 13:45:16,003] ({Thread-1241} JobManager.java[cancelJob]:137) 
> - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with 
> paragraph paragraph_1586733868269_783581378
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129)
>       at 
> org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648)
>       at 
> org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101)
>       at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119)
>       at 
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
>       at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:873)
>       at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
>       at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSynchronousSavepoint$0(CheckpointCoordinator.java:428)
>       at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
>       at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$1(CheckpointCoordinator.java:457)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>       at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:429)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1445)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1436)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1266)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1253)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1654)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1236)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1214)
>       at 
> org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:421)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:232)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:219)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:207)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:202)
>       at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:58)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobal(ExecutionGraph.java:1035)
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph$1.lambda$failJob$0(ExecutionGraph.java:468)
>       ... 22 more
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: 
> Checkpoint Coordinator is suspending.
>       at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:428)
>       ... 38 more
> ERROR [2020-06-06 13:45:16,007] ({Thread-1241} 
> RemoteInterpreterServer.java[lambda$cancel$1]:802) - Fail to cancel 
> paragraph: paragraph_1586733868269_783581378
>  WARN [2020-06-06 13:45:16,283] ({pool-1-thread-3} 
> JobManager.java[getJobProgress]:99) - Unable to get job progress for 
> paragraph: paragraph_1586733868269_783581378, because no job is associated 
> with this paragraph
>  INFO [2020-06-06 13:45:16,742] ({pool-6-thread-1} 
> AbstractStreamSqlJob.java[run]:245) - Refresh result of paragraph: 
> paragraph_1586847370895_154139610
>  WARN [2020-06-06 13:45:16,784] ({pool-1-thread-3} 
> JobManager.java[getJobProgress]:99) - Unable to get job progress for 
> paragraph: paragraph_1586733868269_783581378, because no job is associated 
> with this paragraph
>  WARN [2020-06-06 13:45:17,211] ({Thread-1240} 
> JobManager.java[cancelJob]:137) - Fail to cancel job 
> 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph 
> paragraph_1586733868269_783581378
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
> Coordinator is suspending.
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129)
>       at 
> org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648)
>       at 
> org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101)
>       at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119)
>       at 
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800)
>       at java.lang.Thread.run(Thread.java:748) {code}
> But in the flink web UI, I see that one vertex is failed and another is 
> cancelled. 
> !image-2020-06-06-15-39-35-441.png!
> And when I call rest api for check the status of this job. I see that the job 
> state is RUNNING. But this job just hangs there, never recover or do anything 
> else.
> {code:java}
> {jid: "cc69431798db3e8a3541b4ec4c020e5d",name: "UnnamedTable_select url, 
> count(1) as c from log group by url_0",isStoppable: false,state: 
> "RUNNING",start-time: 1591351246553,end-time: -1,duration: 77611856,now: 
> 1591428858409, {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to