Hey Gary, Yes, I was still running with the `-m` flag on my dev machine -- partially configured like prod, but without the HA stuff. I never thought it could be a problem, since even the web interface can redirect from the secondary back to primary.
Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon as I can fix this). I'll try again with the HA/ZooKeeper properly set up on my machine and, if it still balks, I'll send the (updated) logs. On Thu, May 3, 2018 at 9:36 AM, Gary Yao <g...@data-artisans.com> wrote: > Hi Julio, > > Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you > cannot manually specify the JobManager address. The client determines the > leader > through ZooKeeper. If you did not configure the ZooKeeper quorum in the > flink-conf.yaml on the machine from which you are submitting, this might > explain > the error message. > > > But that didn't solve my problem. So far, the `flink run` still fails > with the same message (I'm adding the full stacktrace of the failure in the > end, just in case), but now I'm also seeing this message in the JobManager > logs: > Unfortunately, the error message in your previous email is different. If > the > above does not solve your problem, can you attach the logs of the client > and > JobManager? > > Lastly, what Flink version are you running? > > Best, > Gary > > On Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.bia...@azion.com> > wrote: > >> Hey guys and gals, >> >> So, after a bit more digging, I found out that once HA is enabled, >> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`, >> but I was expecting this). Because I set the >> `high-availability.jobmanager.port` >> to `50010-50015`, my RPC port also changed (the docs made me think this >> would only affect the HA communication, not ALL communications). This can >> be checked on the Dashboard, under the JobManager configuration option. >> >> But that didn't solve my problem. So far, the `flink run` still fails >> with the same message (I'm adding the full stacktrace of the failure in the >> end, just in case), but now I'm also seeing this message in the JobManager >> logs: >> >> 2018-05-02 16:44:32,373 WARN org.apache.flink.runtime.jobma >> nager.JobManager - Discard message >> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId: >> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader >> session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the >> received leader session ID 00000000-0000-0000-0000-000000 >> 000000. >> >> >> So, I'm still lost on where to go forward. >> >> >> Failure when using `flink run`: >> >> org.apache.flink.client.program.ProgramInvocationException: The program >> execution failed: JobManager did not respond within 60000 >> ms >> >> at org.apache.flink.client.program.ClusterClient.runDetached( >> ClusterClient.java:524) >> at org.apache.flink.client.program.StandaloneClusterClient.subm >> itJob(StandaloneClusterClient.java:103) >> at org.apache.flink.client.program.ClusterClient.run(ClusterCli >> ent.java:456) >> at org.apache.flink.client.program.DetachedEnvironment.finalize >> Execute(DetachedEnvironment.java:77) >> at org.apache.flink.client.program.ClusterClient.run(ClusterCli >> ent.java:402) >> at org.apache.flink.client.CliFrontend.executeProgram(CliFronte >> nd.java:802) >> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282) >> at org.apache.flink.client.CliFrontend.parseParameters(CliFront >> end.java:1054) >> at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java: >> 1101) >> at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java: >> 1098) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro >> upInformation.java:1698) >> at org.apache.flink.runtime.security.HadoopSecurityContext.runS >> ecured(HadoopSecurityContext.java:41) >> at org.apache.flink.client.CliFrontend.main(CliFrontend.java: >> 1098) >> Caused by: org.apache.flink.runtime.client.JobTimeoutException: >> JobManager did not respond within 60000 ms >> at org.apache.flink.runtime.client.JobClient.submitJobDetached( >> JobClient.java:437) >> at org.apache.flink.client.program.ClusterClient.runDetached( >> ClusterClient.java:516) >> ... 14 more >> Caused by: java.util.concurrent.TimeoutException >> at java.util.concurrent.CompletableFuture.timedGet(CompletableF >> uture.java:1771) >> at java.util.concurrent.CompletableFuture.get(CompletableFuture >> .java:1915) >> at org.apache.flink.runtime.client.JobClient.submitJobDetached( >> JobClient.java:435) >> ... 15 more >> >> >> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.bia...@azion.com> >> wrote: >> >>> Hello all, >>> >>> I'm building a standalone cluster with HA JobManager. So far, everything >>> seems to work, but when i try to `flink run` my job, it fails with the >>> following error: >>> >>> Caused by: >>> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: >>> Could not retrieve the leader gateway. >>> >>> So far, I have two different machines running the JobManager and, >>> looking at the logs, I can't see any problem whatsoever to explain why the >>> flink command is refusing to run the pipeline... >>> >>> Any ideas where I should look? >>> >>> -- >>> *Julio Biason*, Sofware Engineer >>> *AZION* | Deliver. Accelerate. Protect. >>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>> <callto:+5551996209291>*99907 0554* >>> >> >> >> >> -- >> *Julio Biason*, Sofware Engineer >> *AZION* | Deliver. Accelerate. Protect. >> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >> <callto:+5551996209291>*99907 0554* >> > > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*