[ https://issues.apache.org/jira/browse/FLINK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668448#comment-16668448 ]
wgcn commented on FLINK-5770: ----------------------------- your client maybe shutdown you can use the arg -d > Flink yarn session stop in non-detached model > --------------------------------------------- > > Key: FLINK-5770 > URL: https://issues.apache.org/jira/browse/FLINK-5770 > Project: Flink > Issue Type: Bug > Components: Client > Affects Versions: 1.2.0 > Environment: 1、the cluster contains 4 nodes; > 2、every node has 380GB memory, and the CPU has 40 cores; > 3、the OS is centOS7.2; > Reporter: zhangrucong1982 > Priority: Major > > 1、I user the recent version of flink, and use fink in security mode without > HA.the configurations in flink-conf.yaml are: > security.kerberos.login.keytab: > /home/demo/flink/release/flink-1.2.2/keytab/huawei1.keytab > security.kerberos.login.principal: huawei1 > security.kerberos.login.contexts: Client,KafkaClient > 2、then I use the command ./yarn-session.sh -n 2 to start the cluster with > two taskmanagers. > 3、 But About the 4 hours later, the session is shutting down by itself. the > error stack is following: > 2017-02-07 19:27:30,841 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@9-96-101-251:38650] has failed, address is now gated for > [5000] ms. Reason: [Disassociated] > 2017-02-07 19:27:42,804 WARN org.apache.flink.yarn.cli.FlinkYarnSessionCli > - Exception while running the interactive command line interface > java.lang.RuntimeException: Unable to get ClusterClient status from > Application Client > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:410) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:663) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at > org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473) > Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: > Could not retrieve the leader gateway > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:142) > at > org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:691) > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243) > ... 10 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [10000 milliseconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at scala.concurrent.Await.result(package.scala) > at > org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:140) > ... 12 more > 4、the detail log you can see in the following : > https://docs.google.com/document/d/1mbxrCy6mHHFxcxPv8f7CCA3BI1QVGPeNiHxUQhuZP0o/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005)