I notice that sometimes when I try to cancel a Flink job with savepoint,
the cancel fails with the following error:

org.apache.flink.util.FlinkException: Could not cancel job
3be3d380dca9bb6a5cf0d559d54d7ff8.
        at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:581)
        at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:955)
        at
org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:573)
        at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1029)
        at
org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1096)
        at
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
        at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1096)
Caused by: java.util.concurrent.ExecutionException:
java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException:
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to
trigger savepoint. Decline reason: Not all required tasks are currently
running.
        at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:385)
        at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:579)
        ... 6 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException:
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to
trigger savepoint. Decline reason: Not all required tasks are currently
running.
        at
org.apache.flink.runtime.jobmaster.JobMaster.lambda$triggerSavepoint$13(JobMaster.java:959)
        at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        at
java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
        at
java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
        at
org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:955)
        at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
        at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException:
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to
trigger savepoint. Decline reason: Not all required tasks are currently
running.
        at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at
java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:614)
        at
java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1983)
        at
org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:947)
        ... 20 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointTriggerException:
Failed to trigger savepoint. Decline reason: Not all required tasks are
currently running.
        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:377)
        at
org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:946)
        ... 20 more


Also, I see the following lines in the JobManager logs:
2018-07-11 05:41:13,316 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint
triggering task Source: Custom Source -> Flat Map (1/3) of job
e691fa002c682703735afb178ce6ba37 is not being executed at the moment.
Aborting checkpoint.
2018-07-11 05:41:13,517 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint
triggering task Source: Custom Source -> Flat Map (1/3) of job
e691fa002c682703735afb178ce6ba37 is not being executed at the moment.
Aborting checkpoint.
2018-07-11 05:41:13,716 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint
triggering task Source: Custom Source -> Flat Map (1/3) of job
e691fa002c682703735afb178ce6ba37 is not being executed at the moment.
Aborting checkpoint.

Retrying the cancel at this point doesn't help. It keeps failing with the
same error till the job runs to completion.
Note that this issue happens intermittently, not always.

Do I need to do anything in particular in my application source and sink
checkpointing code? Have I forgotten to take care of something? I am using
flink-1.5.0.

I came across a similar issue here, but I don't see any updates:
http://mail-archives.apache.org/mod_mbox/flink-issues/201706.mbox/%3cjira.13082229.1498251834000.96074.1498260420...@atlassian.jira%3E

Regards,
James

Reply via email to