Hi all,
I submit a flink job through yarn-cluster mode and cancel job with
savepoint option immediately after job status change to deployed. Sometimes i
met this error:
org.apache.flink.util.FlinkException: Could not cancel job xxxx.
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException:
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not
complete the operation. Number of retries has been exhausted.
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
Could not complete the operation. Number of retries has been exhausted.
at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException:
Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 7 more
I check the jobmanager log, no error found. Savepoint is correct saved in
hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to
KILLED.
I think this issue occur because RestClusterClient cannot find jobmanager
addresss after Jobmanager(AM) has shutdown.
My flink version is 1.5.3.
Anyone could help me to resolve this issue, thanks!
devin.