[ https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aljoscha Krettek updated FLINK-7540: ------------------------------------ Component/s: Distributed Coordination > submit a job on yarn-cluster mode or start a yarn-session failed,in hadoop > cluster with capitalized hostname > ------------------------------------------------------------------------------------------------------------ > > Key: FLINK-7540 > URL: https://issues.apache.org/jira/browse/FLINK-7540 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, YARN > Affects Versions: 1.3.1, 1.4.0, 1.3.2 > Reporter: Tong Yan Ou > Priority: Blocker > Labels: patch > Fix For: 1.3.3 > > Original Estimate: 336h > Remaining Estimate: 336h > > Hostnames in my hadoop cluster are like these: “DSJ-RTB-4T-177”,” > DSJ-signal-900G-71” > When using the following command: > ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 > ~/flink-1.3.1/examples/batch/WordCount.jar --input > /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result > > Or > ./bin/yarn-session.sh -d -jm 6144 -tm 12288 -qu xl_trip -s 24 -n 5 -nm > "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip" > There will be some exceptions at Command line interface: > java.lang.RuntimeException: Unable to get ClusterClient status from > Application Client > at > org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243) > … > Caused by: org.apache.flink.util.FlinkException: Could not connect to the > leading JobManager. Please check that the JobManager is running. > h4. Then the job fails , starting the yarn-session is the same. > The exceptions of the application log: > 2017-08-10 17:36:10,334 WARN > org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to > retrieve leader gateway and port. > akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), > Path(/user/jobmanager)] > … > 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager > - Resource manager could not register at JobManager > akka.pattern.AskTimeoutException: Ask timed out on > [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), > Path(/user/jobmanager)]] after [10000 ms] > And I found some differences in actor System: > 2017-08-10 17:35:56,791 INFO org.apache.flink.yarn.YarnJobManager > - Starting JobManager at > akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager. > 2017-08-10 17:35:56,880 INFO org.apache.flink.yarn.YarnJobManager > - JobManager > akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted > leadership with leader session ID Some(00000000-0000-0000-0000-000000000000). > 2017-08-10 17:36:00,312 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend > listening at 0:0:0:0:0:0:0:0:54921 > 2017-08-10 17:36:00,312 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with > JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port > 54921 > 2017-08-10 17:36:00,313 INFO > org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader > reachable under > akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000. > The JobManager is “akka.tcp://flink@DSJ-signal-4T-248:65082” and the > JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082” > The hostname of JobManagerRetriever’s actor is lowercase. > And I read source code, > Class NetUtils the unresolvedHostToNormalizedString(String host) method of > line 127: > public static String unresolvedHostToNormalizedString(String host) { > > // Return loopback interface address if host is null > // This represents the behavior of {@code InetAddress.getByName } and RFC > 3330 if (host == null) { > host = InetAddress.getLoopbackAddress().getHostAddress(); > } else { host = host.trim().toLowerCase(); > } > ... > } > It turns the host name into lowercase. > Therefore, JobManagerRetriever certainly can not find Jobmanager's > actorSYstem. > Then I removed the call to the toLowerCase() method in the source code. > Finally ,I can submit a job in yarn-cluster mode and start a yarn-session. -- This message was sent by Atlassian JIRA (v6.4.14#64029)