Re: Leader Retrieval Timeout with HA Job Manager

Jason Kania Tue, 15 May 2018 08:34:27 -0700

 Thanks for your help. The job manager launch on two nodes of the cluster is 
provided as well as the logs for the task managers, one that worked and one 
that could not seem to find the find an address which I am assuming is for the 
job manager. The logs are from nodes aaa-1 and aaa-2.


Thanks,

Jason

    On Tuesday, May 15, 2018, 9:59:58 a.m. EDT, Timo Walther 
<twal...@apache.org> wrote:  
 
  Can you change the log level to DEBUG and share the logs with us? Maybe Till 
(in CC) has some idea?
 
 Regards,
 Timo
 
 
 Am 15.05.18 um 15:18 schrieb Jason Kania:
  
  Hi Timo,
  
 Thanks for the response.
 
   Yes, we are running with a cloud provider, a cloud system provided by our 
national government for R&D purposes. The thing is that we also have Kafka and 
Cassandra on the same nodes and they have no issues in this environment, it is 
just Flink in an HA configuration that has problems so it is strange.
 
 Is there any additional logging available for analysis of these sorts of 
scenarios? The details in the current logs are insufficient to know what is 
happening.
 
 Thanks,
 
 Jason
         
     On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther 
<twal...@apache.org> wrote:  
  
     Hi Jason,
 
 this sounds more like a network connection/firewall issue to me. Can you tell 
us a bit more about your environment? Are you running your Flink cluster on a 
cloud provider?
 
 Regards,
 Timo
 
 
 Am 15.05.18 um 05:15 schrieb Jason Kania:
   
  Hi,
 
 I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job 
Manager, but unfortunately using HA  functionality prevents job submission with 
the following error:
 
 java.lang.RuntimeException: Failed to retrieve JobManager address
         
atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
         
atorg.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
         at 
org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
         
atorg.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
         
atorg.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
 Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:  
Could not retrieve the leader address and leader session ID.
         
atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
         
atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
         ... 8 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
[60000 milliseconds]
         at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
         at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
         
atscala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
         at scala.concurrent.Await$.result(package.scala:190)
         at scala.concurrent.Await.result(package.scala)
         
atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
         ... 9 more
 
   This seems to also be tied to problems in having the TaskManager register. I 
have to repeatedly  restart the TaskManager until it finally connects to the 
Job Manager. Most times it doesn't connect and doesn't complain making the 
determination of the root cause more  difficult. The cluster is not busy and I 
have tried both with IP addresses and host names to  determine if name 
resolution issues were the cause, but both situations are the same.
 
 I have also noticed that if 2 job managers are launched on different  nodes in 
the same cluster, they both come back with logging indicating that they are the 
 leader so they are not talking to each other effectively and the logging is 
not even indicating that they are even attempting to talk with one another.
   
 Lastly, the error "Could not retrieve the leader address and leader session 
ID." is a very poor error because it does not tell where it is attempting to 
get the information from.
  
 Any suggestions would be appreciated.
 
   Thanks,
 
 Jason

stream-job-manager-aaa-1.log.gz
Description: application/gzip

stream-processor-aaa-1.log.gz
Description: application/gzip

stream-processor-aaa-2.log.gz
Description: application/gzip

stream-job-manager-aaa-2.log.gz
Description: application/gzip

Re: Leader Retrieval Timeout with HA Job Manager

Reply via email to