[ https://issues.apache.org/jira/browse/FLINK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327793#comment-17327793 ]
Flink Jira Bot commented on FLINK-18167: ---------------------------------------- This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Flink Job hangs there when one vertex is failed and another is cancelled. > -------------------------------------------------------------------------- > > Key: FLINK-18167 > URL: https://issues.apache.org/jira/browse/FLINK-18167 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.10.0 > Reporter: Jeff Zhang > Priority: Major > Labels: stale-major > Attachments: image-2020-06-06-15-39-35-441.png > > > After I call cancel with savepoint, the cancel operation is failed. The > following is what I see in client side. > {code:java} > WARN [2020-06-06 13:45:16,003] ({Thread-1241} JobManager.java[cancelJob]:137) > - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with > paragraph paragraph_1586733868269_783581378 > java.util.concurrent.ExecutionException: > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129) > at > org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648) > at > org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101) > at > org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119) > at > org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:873) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSynchronousSavepoint$0(CheckpointCoordinator.java:428) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$1(CheckpointCoordinator.java:457) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:429) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1445) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1436) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1266) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1253) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1654) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1236) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1214) > at > org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:421) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:232) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:219) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:207) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:202) > at > org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:58) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobal(ExecutionGraph.java:1035) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph$1.lambda$failJob$0(ExecutionGraph.java:468) > ... 22 more > Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: > Checkpoint Coordinator is suspending. > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:428) > ... 38 more > ERROR [2020-06-06 13:45:16,007] ({Thread-1241} > RemoteInterpreterServer.java[lambda$cancel$1]:802) - Fail to cancel > paragraph: paragraph_1586733868269_783581378 > WARN [2020-06-06 13:45:16,283] ({pool-1-thread-3} > JobManager.java[getJobProgress]:99) - Unable to get job progress for > paragraph: paragraph_1586733868269_783581378, because no job is associated > with this paragraph > INFO [2020-06-06 13:45:16,742] ({pool-6-thread-1} > AbstractStreamSqlJob.java[run]:245) - Refresh result of paragraph: > paragraph_1586847370895_154139610 > WARN [2020-06-06 13:45:16,784] ({pool-1-thread-3} > JobManager.java[getJobProgress]:99) - Unable to get job progress for > paragraph: paragraph_1586733868269_783581378, because no job is associated > with this paragraph > WARN [2020-06-06 13:45:17,211] ({Thread-1240} > JobManager.java[cancelJob]:137) - Fail to cancel job > 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph > paragraph_1586733868269_783581378 > java.util.concurrent.ExecutionException: > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > Coordinator is suspending. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129) > at > org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648) > at > org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101) > at > org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119) > at > org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800) > at java.lang.Thread.run(Thread.java:748) {code} > But in the flink web UI, I see that one vertex is failed and another is > cancelled. > !image-2020-06-06-15-39-35-441.png! > And when I call rest api for check the status of this job. I see that the job > state is RUNNING. But this job just hangs there, never recover or do anything > else. > {code:java} > {jid: "cc69431798db3e8a3541b4ec4c020e5d",name: "UnnamedTable_select url, > count(1) as c from log group by url_0",isStoppable: false,state: > "RUNNING",start-time: 1591351246553,end-time: -1,duration: 77611856,now: > 1591428858409, {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)