I am trying to upgrade a job from flink 1.4.2 to 1.6.0

When we do a deploy we cancel the job with a savepoint then deploy the new
version of the job from that savepoint. Because our jobs tend to have a lot
of state it often takes multiple minutes for our savepoints to complete.

On flink 1.4.2 we set *akka.client.timeout* to a high value to make sure
the request did not timeout

However on flink 1.6.0 I get an *AskTimeoutException*  and increasing
*akka.client.timeout* only works if i apply it to the running flink process.
Applying it to just the flink client does nothing.

I am reluctant to configure this on the container itself because afaik it
applies to everything inside of flink's internal actor system not just to
creating savepoints.

What is the correct way to use cancel with savepoint for jobs with lots of
state in flink 1.6.0 ?

I Attached the error.
Cancelling job 9068efa57d6599e7d6a71b7f7eac7d2f with savepoint to default 
savepoint directory.

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not cancel job 
9068efa57d6599e7d6a71b7f7eac7d2f.
        at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$6(CliFrontend.java:604)
        at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979)
        at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:596)
        at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
        at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on [Actor[akka://flink/user/jobmanager_1#172582095]] after [10000 
ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at 
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:407)
        at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$6(CliFrontend.java:602)
        ... 9 more
Caused by: java.util.concurrent.CompletionException: 
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/jobmanager_1#172582095]] after [10000 ms]. 
Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
        at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
        at 
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
        at akka.dispatch.OnComplete.internal(Future.scala:258)
        at akka.dispatch.OnComplete.internal(Future.scala:256)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
        at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
        at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
        at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/jobmanager_1#172582095]] after [10000 ms]. 
Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
        ... 9 more

Reply via email to