Hi all,
      I submit a flink job through yarn-cluster mode and cancel job with 
savepoint option immediately after job status change to deployed. Sometimes i 
met this error:

org.apache.flink.util.FlinkException: Could not cancel job xxxx.
        at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
        at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
        at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
        at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Number of retries has been exhausted.
        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at 
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
        at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
        ... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
        at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        ... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: 
Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
        at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
        ... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
        ... 7 more

    I check the jobmanager log, no error found. Savepoint is correct saved in 
hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to 
KILLED.
    I think this issue occur because RestClusterClient cannot find jobmanager 
addresss after Jobmanager(AM) has shutdown.
    My flink version is 1.5.3.
    Anyone could help me to resolve this issue, thanks!

Best Regard!

Reply via email to