Hi Julio, I agree that the job submission should work in HA mode if you manually specify the JobManager. At the minimum a proper error message should be shown. Feel free to open an issue in JIRA.
You already stated that you can maintain multiple configuration directories as a workaround. It is possible to switch between them by setting the FLINK_CONF_DIR environment variable, e.g, FLINK_CONF_DIR=/path/to/conf-dir-1 bin/flink run ... FLINK_CONF_DIR=/path/to/conf-dir-2 bin/flink run ... Beginning from 1.5 this should be a non-issue because the job submission happens through HTTP and every non-leading master redirects requests to the leading master. Best, Gary On Thu, May 3, 2018 at 10:23 PM, Julio Biason <julio.bia...@azion.com> wrote: > Hey Gary (again), > > Yup, that worked. Now I can launch apps again. > > ... but that's not something actually good. > > I mean, I have my own test environment, which doesn't need HA -- after > all, I don't need to worry about this, this is a framework job, not my > pipeline job. Which means now I'll need to either keep two different > configuration files and keep switching between them -- because the `flink` > command doesn't not accept a configuration file (or, at least, it's not > listed on `--help`) or I'll have to first copy to the prod/staging machines > and then run there -- which seems a waste, since it seems using `flink run > -m` already adds the file in the blob server and then runs and, doing the > copy-to-machine step means there are two copies going on. > > I mean, if I say "hey flink, run this job _there_", flink should be smart > enough to read how "there" is running things and adjust itself. The > environment which started the run may not follow the same rules as the > target run machine. > > ... and, in the end, it seems this is mostly useless discussion, as > JobManager is changing completely on 1.5 -- but I kinda worry if I will > have the same issue with the new ResourceManager... > > On Thu, May 3, 2018 at 11:00 AM, Julio Biason <julio.bia...@azion.com> > wrote: > >> Hey Gary, >> >> Yes, I was still running with the `-m` flag on my dev machine -- >> partially configured like prod, but without the HA stuff. I never thought >> it could be a problem, since even the web interface can redirect from the >> secondary back to primary. >> >> Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon >> as I can fix this). >> >> I'll try again with the HA/ZooKeeper properly set up on my machine and, >> if it still balks, I'll send the (updated) logs. >> >> On Thu, May 3, 2018 at 9:36 AM, Gary Yao <g...@data-artisans.com> wrote: >> >>> Hi Julio, >>> >>> Are you using the -m flag of "bin/flink run" by any chance? In HA mode, >>> you >>> cannot manually specify the JobManager address. The client determines >>> the leader >>> through ZooKeeper. If you did not configure the ZooKeeper quorum in the >>> flink-conf.yaml on the machine from which you are submitting, this might >>> explain >>> the error message. >>> >>> > But that didn't solve my problem. So far, the `flink run` still fails >>> with the same message (I'm adding the full stacktrace of the failure in the >>> end, just in case), but now I'm also seeing this message in the JobManager >>> logs: >>> Unfortunately, the error message in your previous email is different. If >>> the >>> above does not solve your problem, can you attach the logs of the client >>> and >>> JobManager? >>> >>> Lastly, what Flink version are you running? >>> >>> Best, >>> Gary >>> >>> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.bia...@azion.com> >>> wrote: >>> >>>> Hey guys and gals, >>>> >>>> So, after a bit more digging, I found out that once HA is enabled, >>>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`, >>>> but I was expecting this). Because I set the >>>> `high-availability.jobmanager.port` >>>> to `50010-50015`, my RPC port also changed (the docs made me think this >>>> would only affect the HA communication, not ALL communications). This can >>>> be checked on the Dashboard, under the JobManager configuration option. >>>> >>>> But that didn't solve my problem. So far, the `flink run` still fails >>>> with the same message (I'm adding the full stacktrace of the failure in the >>>> end, just in case), but now I'm also seeing this message in the JobManager >>>> logs: >>>> >>>> 2018-05-02 16:44:32,373 WARN org.apache.flink.runtime.jobma >>>> nager.JobManager - Discard message >>>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId: >>>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected >>>> leader session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal >>>> the received leader session ID 00000000-0000-0000-0000-000000 >>>> 000000. >>>> >>>> >>>> So, I'm still lost on where to go forward. >>>> >>>> >>>> Failure when using `flink run`: >>>> >>>> org.apache.flink.client.program.ProgramInvocationException: The >>>> program execution failed: JobManager did not respond within 60000 >>>> ms >>>> >>>> at org.apache.flink.client.program.ClusterClient.runDetached(Cl >>>> usterClient.java:524) >>>> at org.apache.flink.client.program.StandaloneClusterClient.subm >>>> itJob(StandaloneClusterClient.java:103) >>>> at org.apache.flink.client.program.ClusterClient.run(ClusterCli >>>> ent.java:456) >>>> at org.apache.flink.client.program.DetachedEnvironment.finalize >>>> Execute(DetachedEnvironment.java:77) >>>> at org.apache.flink.client.program.ClusterClient.run(ClusterCli >>>> ent.java:402) >>>> at org.apache.flink.client.CliFrontend.executeProgram(CliFronte >>>> nd.java:802) >>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282 >>>> ) >>>> at org.apache.flink.client.CliFrontend.parseParameters(CliFront >>>> end.java:1054) >>>> at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java: >>>> 1101) >>>> at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java: >>>> 1098) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro >>>> upInformation.java:1698) >>>> at org.apache.flink.runtime.security.HadoopSecurityContext.runS >>>> ecured(HadoopSecurityContext.java:41) >>>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:10 >>>> 98) >>>> Caused by: org.apache.flink.runtime.client.JobTimeoutException: >>>> JobManager did not respond within 60000 ms >>>> at org.apache.flink.runtime.client.JobClient.submitJobDetached( >>>> JobClient.java:437) >>>> at org.apache.flink.client.program.ClusterClient.runDetached(Cl >>>> usterClient.java:516) >>>> ... 14 more >>>> Caused by: java.util.concurrent.TimeoutException >>>> at java.util.concurrent.CompletableFuture.timedGet(CompletableF >>>> uture.java:1771) >>>> at java.util.concurrent.CompletableFuture.get(CompletableFuture >>>> .java:1915) >>>> at org.apache.flink.runtime.client.JobClient.submitJobDetached( >>>> JobClient.java:435) >>>> ... 15 more >>>> >>>> >>>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.bia...@azion.com> >>>> wrote: >>>> >>>>> Hello all, >>>>> >>>>> I'm building a standalone cluster with HA JobManager. So far, >>>>> everything seems to work, but when i try to `flink run` my job, it fails >>>>> with the following error: >>>>> >>>>> Caused by: org.apache.flink.runtime.leade >>>>> rretrieval.LeaderRetrievalException: Could not retrieve the leader >>>>> gateway. >>>>> >>>>> So far, I have two different machines running the JobManager and, >>>>> looking at the logs, I can't see any problem whatsoever to explain why the >>>>> flink command is refusing to run the pipeline... >>>>> >>>>> Any ideas where I should look? >>>>> >>>>> -- >>>>> *Julio Biason*, Sofware Engineer >>>>> *AZION* | Deliver. Accelerate. Protect. >>>>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>>>> <callto:+5551996209291>*99907 0554* >>>>> >>>> >>>> >>>> >>>> -- >>>> *Julio Biason*, Sofware Engineer >>>> *AZION* | Deliver. Accelerate. Protect. >>>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>>> <callto:+5551996209291>*99907 0554* >>>> >>> >>> >> >> >> -- >> *Julio Biason*, Sofware Engineer >> *AZION* | Deliver. Accelerate. Protect. >> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >> <callto:+5551996209291>*99907 0554* >> > > > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* >