Hi Rinat, No, Flink does not have a switch to immediately cancel a job if it cannot allocate enough resources. Maybe YARN has a configuration parameter to define a timeout after which a job is canceled if no resource become available.
2017-09-04 13:29 GMT+02:00 Rinat <r.shari...@cleverdata.ru>: > Hi everyone, I’ve got the following problem, when I’m trying to submit new > job and if cluster has not enough resources, job submission fails with the > following exception > But in *YARN *job hangs and wait’s for requested resources. When > resources become available, job successfully runs. > > What can I do to be sure that job startup is completed successfully or > completely failed ? > > Thx. > > *The program finished with the following > exception:\n\njava.lang.RuntimeException: Unable to tell application master > to stop once the specified job has been finised*\n\tat > org.apache.flink.yarn.YarnClusterClient.stopAfterJob( > YarnClusterClient.java:177)\n\tat org.apache.flink.yarn. > YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat > org.apache.flink.client.program.DetachedEnvironment.finalizeExecute( > DetachedEnvironment.java:76)\n\tat org.apache.flink.client. > program.ClusterClient.run(ClusterClient.java:387)\n\tat > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat > org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat > org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat > org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat > org.apache.flink.runtime.security.HadoopSecurityContext$1.run( > HadoopSecurityContext.java:43)\n\tat java.security. > AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security. > HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat > org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused > by: org.apache.flink.util.FlinkException: Could not connect to the > leading JobManager. Please check that the JobManager is running.\n\tat > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat > org.apache.flink.yarn.YarnClusterClient.stopAfterJob( > YarnClusterClient.java:171)\n\t... 15 more\nCaused by: > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could > not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util. > LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... > 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed > out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$ > DefaultPromise.ready(Promise.scala:219)\n\tat > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat > scala.concurrent.Await$.result(package.scala:190)\n\tat > scala.concurrent.Await.result(package.scala)\n\tat > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway( > LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", > "------------------------------------------------------------", " The > program finished with the following exception:", "", > "java.lang.RuntimeException: Unable to tell application master to stop once > the specified job has been finised", "\tat org.apache.flink.yarn. > YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat > org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", > "\tat > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", > "\tat > org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", > "\tat > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", > "\tat > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", > "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", > "\tat > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", > "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", > "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", > "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run( > HadoopSecurityContext.java:43)", "\tat java.security. > AccessController.doPrivileged(Native Method)", "\tat > javax.security.auth.Subject.doAs(Subject.java:422)", "\tat > org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security. > HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat > org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused > by: org.apache.flink.util.FlinkException: Could not connect to the > leading JobManager. Please check that the JobManager is running.", "\tat > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", > "\tat > org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", > "\t... 15 more", "Caused by: > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: > Could not retrieve the leader gateway.", "\tat > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway( > LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client. > program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", > "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: > Futures timed out after [10000 milliseconds]", "\tat > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", > "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", > "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", > "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn( > BlockContext.scala:53)" >