[jira] [Created] (FLINK-7540) submit a job on yarn-cluster mode or start a yarn-session failed,in hadoop cluster with capitalized hostname

Tong Yan Ou (JIRA) Mon, 28 Aug 2017 03:02:29 -0700

Tong Yan Ou created FLINK-7540:
----------------------------------

             Summary: submit a job on yarn-cluster mode or start a yarn-session 
failed,in hadoop cluster with capitalized hostname
                 Key: FLINK-7540
                 URL: https://issues.apache.org/jira/browse/FLINK-7540
             Project: Flink
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.3.2, 1.3.1, 1.4.0
            Reporter: Tong Yan Ou
             Fix For: 1.3.3



Hostnames in my  hadoop cluster are like these: “DSJ-RTB-4T-177”,” 
DSJ-signal-900G-71”
When using the following command:
./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 
~/flink-1.3.1/examples/batch/WordCount.jar --input 
/user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result  
Or
./bin/yarn-session.sh -d -jm 6144  -tm 12288 -qu xl_trip -s 24 -n 5 -nm 
"flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
There will be some exceptions at Command line interface:

java.lang.RuntimeException: Unable to get ClusterClient status from Application 
Client
at 
org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
…
Caused by: org.apache.flink.util.FlinkException: Could not connect to the 
leading JobManager. Please check that the JobManager is running.

h4. Then the job fails , starting the yarn-session is the same.

The exceptions of the application log:
2017-08-10 17:36:10,334 WARN  
org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to 
retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
Path(/user/jobmanager)]
…
2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager    
            - Resource manager could not register at JobManager
akka.pattern.AskTimeoutException: Ask timed out on 
[ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
Path(/user/jobmanager)]] after [10000 ms]


And I found some differences in actor System:
2017-08-10 17:35:56,791 INFO  org.apache.flink.yarn.YarnJobManager              
            - Starting JobManager at 
akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
2017-08-10 17:35:56,880 INFO  org.apache.flink.yarn.YarnJobManager              
            - JobManager 
akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted leadership 
with leader session ID Some(00000000-0000-0000-0000-000000000000).
2017-08-10 17:36:00,312 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend 
listening at 0:0:0:0:0:0:0:0:54921
2017-08-10 17:36:00,312 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with 
JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port 
54921
2017-08-10 17:36:00,313 INFO  
org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
reachable under 
akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.


The JobManager is  “akka.tcp://flink@DSJ-signal-4T-248:65082” and the 
JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
The hostname of JobManagerRetriever’s actor is lowercase.


And I read source code,
Class NetUtils the unresolvedHostToNormalizedString(String host) method of line 
127:
        public static String unresolvedHostToNormalizedString(String host) {    
        
// Return loopback interface address if host is null            
// This represents the behavior of {@code InetAddress.getByName } and RFC 3330  
        if (host == null) {                     
   host = InetAddress.getLoopbackAddress().getHostAddress();            
} else {                        host = host.trim().toLowerCase();               
}
...
}


It turns the host name into lowercase.
Therefore, JobManagerRetriever certainly can not find Jobmanager's actorSYstem.
Then I removed the call to the toLowerCase() method in the source code.

Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7540) submit a job on yarn-cluster mode or start a yarn-session failed,in hadoop cluster with capitalized hostname

Reply via email to