Can you change the log level to DEBUG and share the logs with us? Maybe
Till (in CC) has some idea?
Regards,
Timo
Am 15.05.18 um 15:18 schrieb Jason Kania:
Hi Timo,
Thanks for the response.
Yes, we are running with a cloud provider, a cloud system provided by
our national government for R&D purposes. The thing is that we also
have Kafka and Cassandra on the same nodes and they have no issues in
this environment, it is just Flink in an HA configuration that has
problems so it is strange.
Is there any additional logging available for analysis of these sorts
of scenarios? The details in the current logs are insufficient to know
what is happening.
Thanks,
Jason
On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther
<twal...@apache.org> wrote:
Hi Jason,
this sounds more like a network connection/firewall issue to me. Can
you tell us a bit more about your environment? Are you running your
Flink cluster on a cloud provider?
Regards,
Timo
Am 15.05.18 um 05:15 schrieb Jason Kania:
Hi,
I am using the 1.4.2 release on ubuntu and attempting to make use of
an HA Job Manager, but unfortunately using HA functionality prevents
job submission with the following error:
java.lang.RuntimeException: Failed to retrieve JobManager address
at
org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
at
org.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
at
org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
at
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
at
org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
at
org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
at
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at
org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by:
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
Could not retrieve the leader address and leader session ID.
at
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
at
org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
... 8 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [60000 milliseconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
... 9 more
This seems to also be tied to problems in having the TaskManager
register. I have to repeatedly restart the TaskManager until it
finally connects to the Job Manager. Most times it doesn't connect
and doesn't complain making the determination of the root cause more
difficult. The cluster is not busy and I have tried both with IP
addresses and host names to determine if name resolution issues were
the cause, but both situations are the same.
I have also noticed that if 2 job managers are launched on different
nodes in the same cluster, they both come back with logging
indicating that they are the leader so they are not talking to each
other effectively and the logging is not even indicating that they
are even attempting to talk with one another.
Lastly, the error "Could not retrieve the leader address and leader
session ID." is a very poor error because it does not tell where it
is attempting to get the information from.
Any suggestions would be appreciated.
Thanks,
Jason