Hi Rinat,

No, Flink does not have a switch to immediately cancel a job if it cannot
allocate enough resources.
Maybe YARN has a configuration parameter to define a timeout after which a
job is canceled if no resource become available.

2017-09-04 13:29 GMT+02:00 Rinat <r.shari...@cleverdata.ru>:

> Hi everyone, I’ve got the following problem, when I’m trying to submit new
> job and if cluster has not enough resources, job submission fails with the
> following exception
> But in *YARN *job hangs and wait’s for requested resources. When
> resources become available, job successfully runs.
>
> What can I do to be sure that job startup is completed successfully or
> completely failed  ?
>
> Thx.
>
> *The program finished with the following
> exception:\n\njava.lang.RuntimeException: Unable to tell application master
> to stop once the specified job has been finised*\n\tat
> org.apache.flink.yarn.YarnClusterClient.stopAfterJob(
> YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.
> YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat
> org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(
> DetachedEnvironment.java:76)\n\tat org.apache.flink.client.
> program.ClusterClient.run(ClusterClient.java:387)\n\tat
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat
> org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat
> org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat
> org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat
> org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
> HadoopSecurityContext.java:43)\n\tat java.security.
> AccessController.doPrivileged(Native Method)\n\tat
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.
> HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat
> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused
> by: org.apache.flink.util.FlinkException: Could not connect to the
> leading JobManager. Please check that the JobManager is running.\n\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat
> org.apache.flink.yarn.YarnClusterClient.stopAfterJob(
> YarnClusterClient.java:171)\n\t... 15 more\nCaused by:
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could
> not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.
> LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t...
> 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed
> out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$
> DefaultPromise.ready(Promise.scala:219)\n\tat
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat
> scala.concurrent.Await$.result(package.scala:190)\n\tat
> scala.concurrent.Await.result(package.scala)\n\tat
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(
> LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["",
> "------------------------------------------------------------", " The
> program finished with the following exception:", "",
> "java.lang.RuntimeException: Unable to tell application master to stop once
> the specified job has been finised", "\tat org.apache.flink.yarn.
> YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat
> org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)",
> "\tat 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)",
> "\tat 
> org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)",
> "\tat 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)",
> "\tat 
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)",
> "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)",
> "\tat 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)",
> "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)",
> "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)",
> "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
> HadoopSecurityContext.java:43)", "\tat java.security.
> AccessController.doPrivileged(Native Method)", "\tat
> javax.security.auth.Subject.doAs(Subject.java:422)", "\tat
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.
> HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat
> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused
> by: org.apache.flink.util.FlinkException: Could not connect to the
> leading JobManager. Please check that the JobManager is running.", "\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)",
> "\tat 
> org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)",
> "\t... 15 more", "Caused by: 
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
> Could not retrieve the leader gateway.", "\tat
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(
> LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.
> program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)",
> "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException:
> Futures timed out after [10000 milliseconds]", "\tat
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)",
> "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)",
> "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)",
> "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(
> BlockContext.scala:53)"
>

Reply via email to