Re: Cannot submit jobs on a HA Standalone JobManager

Gary Yao Sat, 05 May 2018 01:20:03 -0700

Hi Julio,

I agree that the job submission should work in HA mode if you manually
specify
the JobManager. At the minimum a proper error message should be shown. Feel
free
to open an issue in JIRA.


You already stated that you can maintain multiple configuration directories
as a
workaround. It is possible to switch between them by setting the
FLINK_CONF_DIR
environment variable, e.g,

  FLINK_CONF_DIR=/path/to/conf-dir-1 bin/flink run ...
  FLINK_CONF_DIR=/path/to/conf-dir-2 bin/flink run ...

Beginning from 1.5 this should be a non-issue because the job submission
happens
through HTTP and every non-leading master redirects requests to the leading
master.

Best,
Gary

On Thu, May 3, 2018 at 10:23 PM, Julio Biason <julio.bia...@azion.com>
wrote:

> Hey Gary (again),
>
> Yup, that worked. Now I can launch apps again.
>
> ... but that's not something actually good.
>
> I mean, I have my own test environment, which doesn't need HA -- after
> all, I don't need to worry about this, this is a framework job, not my
> pipeline job. Which means now I'll need to either keep two different
> configuration files and keep switching between them -- because the `flink`
> command doesn't not accept a configuration file (or, at least, it's not
> listed on `--help`) or I'll have to first copy to the prod/staging machines
> and then run there -- which seems a waste, since it seems using `flink run
> -m` already adds the file in the blob server and then runs and, doing the
> copy-to-machine step means there are two copies going on.
>
> I mean, if I say "hey flink, run this job _there_", flink should be smart
> enough to read how "there" is running things and adjust itself. The
> environment which started the run may not follow the same rules as the
> target run machine.
>
> ... and, in the end, it seems this is mostly useless discussion, as
> JobManager is changing completely on 1.5 -- but I kinda worry if I will
> have the same issue with the new ResourceManager...
>
> On Thu, May 3, 2018 at 11:00 AM, Julio Biason <julio.bia...@azion.com>
> wrote:
>
>> Hey Gary,
>>
>> Yes, I was still running with the `-m` flag on my dev machine --
>> partially configured like prod, but without the HA stuff. I never thought
>> it could be a problem, since even the web interface can redirect from the
>> secondary back to primary.
>>
>> Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon
>> as I can fix this).
>>
>> I'll try again with the HA/ZooKeeper properly set up on my machine and,
>> if it still balks, I'll send the (updated) logs.
>>
>> On Thu, May 3, 2018 at 9:36 AM, Gary Yao <g...@data-artisans.com> wrote:
>>
>>> Hi Julio,
>>>
>>> Are you using the -m flag of "bin/flink run" by any chance? In HA mode,
>>> you
>>> cannot manually specify the JobManager address. The client determines
>>> the leader
>>> through ZooKeeper. If you did not configure the ZooKeeper quorum in the
>>> flink-conf.yaml on the machine from which you are submitting, this might
>>> explain
>>> the error message.
>>>
>>> > But that didn't solve my problem. So far, the `flink run` still fails
>>> with the same message (I'm adding the full stacktrace of the failure in the
>>> end, just in case), but now I'm also seeing this message in the JobManager
>>> logs:
>>> Unfortunately, the error message in your previous email is different. If
>>> the
>>> above does not solve your problem, can you attach the logs of the client
>>> and
>>> JobManager?
>>>
>>> Lastly, what Flink version are you running?
>>>
>>> Best,
>>> Gary
>>>
>>> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.bia...@azion.com>
>>> wrote:
>>>
>>>> Hey guys and gals,
>>>>
>>>> So, after a bit more digging, I found out that once HA is enabled,
>>>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
>>>> but I was expecting this). Because I set the 
>>>> `high-availability.jobmanager.port`
>>>> to `50010-50015`, my RPC port also changed (the docs made me think this
>>>> would only affect the HA communication, not ALL communications). This can
>>>> be checked on the Dashboard, under the JobManager configuration option.
>>>>
>>>> But that didn't solve my problem. So far, the `flink run` still fails
>>>> with the same message (I'm adding the full stacktrace of the failure in the
>>>> end, just in case), but now I'm also seeing this message in the JobManager
>>>> logs:
>>>>
>>>> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.jobma
>>>> nager.JobManager                - Discard message
>>>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>>>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected
>>>> leader session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal
>>>> the received leader session ID 00000000-0000-0000-0000-000000
>>>> 000000.
>>>>
>>>>
>>>> So, I'm still lost on where to go forward.
>>>>
>>>>
>>>> Failure when using `flink run`:
>>>>
>>>> org.apache.flink.client.program.ProgramInvocationException: The
>>>> program execution failed: JobManager did not respond within 60000
>>>> ms
>>>>
>>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>>> usterClient.java:524)
>>>>         at org.apache.flink.client.program.StandaloneClusterClient.subm
>>>> itJob(StandaloneClusterClient.java:103)
>>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>>> ent.java:456)
>>>>         at org.apache.flink.client.program.DetachedEnvironment.finalize
>>>> Execute(DetachedEnvironment.java:77)
>>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>>> ent.java:402)
>>>>         at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>>>> nd.java:802)
>>>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282
>>>> )
>>>>         at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>>>> end.java:1054)
>>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>>> 1101)
>>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>>> 1098)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>>> upInformation.java:1698)
>>>>         at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>>>> ecured(HadoopSecurityContext.java:41)
>>>>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:10
>>>> 98)
>>>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>>>> JobManager did not respond within 60000 ms
>>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>>> JobClient.java:437)
>>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>>> usterClient.java:516)
>>>>         ... 14 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at java.util.concurrent.CompletableFuture.timedGet(CompletableF
>>>> uture.java:1771)
>>>>         at java.util.concurrent.CompletableFuture.get(CompletableFuture
>>>> .java:1915)
>>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>>> JobClient.java:435)
>>>>         ... 15 more
>>>>
>>>>
>>>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.bia...@azion.com>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm building a standalone cluster with HA JobManager. So far,
>>>>> everything seems to work, but when i try to `flink run` my job, it fails
>>>>> with the following error:
>>>>>
>>>>> Caused by: org.apache.flink.runtime.leade
>>>>> rretrieval.LeaderRetrievalException: Could not retrieve the leader
>>>>> gateway.
>>>>>
>>>>> So far, I have two different machines running the JobManager and,
>>>>> looking at the logs, I can't see any problem whatsoever to explain why the
>>>>> flink command is refusing to run the pipeline...
>>>>>
>>>>> Any ideas where I should look?
>>>>>
>>>>> --
>>>>> *Julio Biason*, Sofware Engineer
>>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>>> <callto:+5551996209291>*99907 0554*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Julio Biason*, Sofware Engineer
>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>> <callto:+5551996209291>*99907 0554*
>>>>
>>>
>>>
>>
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>

Re: Cannot submit jobs on a HA Standalone JobManager

Reply via email to