Hi all, I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error:
org.apache.flink.util.FlinkException: Could not cancel job xxxx. at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960) at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398) at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583) ... 6 more Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted. at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ... 1 more Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ... 16 more Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281) ... 7 more I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED. I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown. My flink version is 1.5.3. Anyone could help me to resolve this issue, thanks! Best Regard!