Hi everyone, I’ve got the following problem, when I’m trying to submit new job and if cluster has not enough resources, job submission fails with the following exception But in YARN job hangs and wait’s for requested resources. When resources become available, job successfully runs.
What can I do to be sure that job startup is completed successfully or completely failed ? Thx. The program finished with the following exception:\n\njava.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)\n\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)\n\t... 15 more\nCaused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\n\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat scala.concurrent.Await$.result(package.scala:190)\n\tat scala.concurrent.Await.result(package.scala)\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", "------------------------------------------------------------", " The program finished with the following exception:", "", "java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)", "\tat java.security.AccessController.doPrivileged(Native Method)", "\tat javax.security.auth.Subject.doAs(Subject.java:422)", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.", "\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]", "\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)"