Hi Francisco, have you set the right high-availability configuration options in your client configuration as described here [1]? If not, then Flink is not able to find the correct JobManager because it retrieves the address as well as a fencing token (called leader session id) from the HA store (ZooKeeper).
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/mesos.html#high-availability Cheers, Till On Thu, Jul 27, 2017 at 6:20 PM, Francisco Gonzalez Barea < francisco.gonza...@piksel.com> wrote: > Hello, > > We´re having lot of issues while trying to submit a job remotely using the > Flink CLI command line tool. We have tried different configurations but in > all of them we get errors from AKKA while trying to connect. I will try to > summarise the configurations we´ve tried. > > - Flink 1.3.0 deployed within a docker container on a Mesos cluster (using > Marathon) > - This flink has the property jobmanager.rpc.address as a hostname (i.e. > kind of ip-XXXXXXXXX.eu.west-1.compute.internal) > - Use the same version for Flink Client remotely (e.g. in my laptop). > > When I try to submit the job using the command flink run -m > myHostName:myPort (the same in jobmanager.rpc.address and > jobmanager.rpc.port) after some time waiting I get the trace at the end of > this email. In the flink side we get this error from AKKA: > > Association with remote system [akka.tcp://flink@10.203.23.24:24469] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@10.203.23.24:24469]] Caused by: [Connection > refused: /10.203.23.24:24469] > > After reading a bit, it seems there´re some problems related to akka > resolving hostnames to ips, so we decided to startup the same flink but > changing jobmanager.rpc.address to have the direct ip (i.e. kind of > XX.XXX.XX.XX). In this case I´m getting same trace (at the end of the > email) from the client side and this one from the Flink server: > > Discard message > LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId: > b25d5c5ced962632abc5ee9ef867792e),DETACHED)) because the expected leader > session ID b4f53899-5d70-467e-8e9d-e56eeb60b6e3 did not equal the > received leader session ID 00000000-0000-0000-0000-000000000000. > > We have tried some other stuff but without success… any clue that could > help us? > > Thanks in advance! > > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed: JobManager did not respond within 60000 milliseconds > at org.apache.flink.client.program.ClusterClient. > runDetached(ClusterClient.java:454) > at org.apache.flink.client.program.StandaloneClusterClient.submitJob( > StandaloneClusterClient.java:99) > at org.apache.flink.client.program.ClusterClient.run( > ClusterClient.java:400) > at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute( > DetachedEnvironment.java:76) > at org.apache.flink.client.program.ClusterClient.run( > ClusterClient.java:345) > at org.apache.flink.client.CliFrontend.executeProgram( > CliFrontend.java:831) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256) > at org.apache.flink.client.CliFrontend.parseParameters( > CliFrontend.java:1073) > at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120) > at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117) > at org.apache.flink.runtime.security.HadoopSecurityContext$1.run( > HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1548) > at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured( > HadoopSecurityContext.java:40) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116) > Caused by: org.apache.flink.runtime.client.JobTimeoutException: > JobManager did not respond within 60000 milliseconds > at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient. > java:426) > at org.apache.flink.client.program.ClusterClient. > runDetached(ClusterClient.java:451) > ... 15 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [60000 milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn( > BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at scala.concurrent.Await.result(package.scala) > at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient. > java:423) > ... 16 more > > > > This message is private and confidential. If you have received this > message in error, please notify the sender or serviced...@piksel.com and > remove it from your system. > > Piksel Inc is a company registered in the United States, 2100 Powers Ferry > Road SE, Suite 400, Atlanta, GA 30339 >