Re: Cannot submit jobs on a HA Standalone JobManager

Julio Biason Thu, 03 May 2018 07:00:47 -0700

Hey Gary,

Yes, I was still running with the `-m` flag on my dev machine -- partially
configured like prod, but without the HA stuff. I never thought it could be
a problem, since even the web interface can redirect from the secondary
back to primary.


Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon
as I can fix this).

I'll try again with the HA/ZooKeeper properly set up on my machine and, if
it still balks, I'll send the (updated) logs.

On Thu, May 3, 2018 at 9:36 AM, Gary Yao <g...@data-artisans.com> wrote:

> Hi Julio,
>
> Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you
> cannot manually specify the JobManager address. The client determines the
> leader
> through ZooKeeper. If you did not configure the ZooKeeper quorum in the
> flink-conf.yaml on the machine from which you are submitting, this might
> explain
> the error message.
>
> > But that didn't solve my problem. So far, the `flink run` still fails
> with the same message (I'm adding the full stacktrace of the failure in the
> end, just in case), but now I'm also seeing this message in the JobManager
> logs:
> Unfortunately, the error message in your previous email is different. If
> the
> above does not solve your problem, can you attach the logs of the client
> and
> JobManager?
>
> Lastly, what Flink version are you running?
>
> Best,
> Gary
>
> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.bia...@azion.com>
> wrote:
>
>> Hey guys and gals,
>>
>> So, after a bit more digging, I found out that once HA is enabled,
>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
>> but I was expecting this). Because I set the 
>> `high-availability.jobmanager.port`
>> to `50010-50015`, my RPC port also changed (the docs made me think this
>> would only affect the HA communication, not ALL communications). This can
>> be checked on the Dashboard, under the JobManager configuration option.
>>
>> But that didn't solve my problem. So far, the `flink run` still fails
>> with the same message (I'm adding the full stacktrace of the failure in the
>> end, just in case), but now I'm also seeing this message in the JobManager
>> logs:
>>
>> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.jobma
>> nager.JobManager                - Discard message
>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader
>> session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the
>> received leader session ID 00000000-0000-0000-0000-000000
>> 000000.
>>
>>
>> So, I'm still lost on where to go forward.
>>
>>
>> Failure when using `flink run`:
>>
>> org.apache.flink.client.program.ProgramInvocationException: The program
>> execution failed: JobManager did not respond within 60000
>> ms
>>
>>         at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:524)
>>         at org.apache.flink.client.program.StandaloneClusterClient.subm
>> itJob(StandaloneClusterClient.java:103)
>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:456)
>>         at org.apache.flink.client.program.DetachedEnvironment.finalize
>> Execute(DetachedEnvironment.java:77)
>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:402)
>>         at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>> nd.java:802)
>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>>         at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>> end.java:1054)
>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>> 1101)
>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>> 1098)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1698)
>>         at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>> ecured(HadoopSecurityContext.java:41)
>>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:
>> 1098)
>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>> JobManager did not respond within 60000 ms
>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:437)
>>         at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:516)
>>         ... 14 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at java.util.concurrent.CompletableFuture.timedGet(CompletableF
>> uture.java:1771)
>>         at java.util.concurrent.CompletableFuture.get(CompletableFuture
>> .java:1915)
>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:435)
>>         ... 15 more
>>
>>
>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.bia...@azion.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I'm building a standalone cluster with HA JobManager. So far, everything
>>> seems to work, but when i try to `flink run` my job, it fails with the
>>> following error:
>>>
>>> Caused by: 
>>> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
>>> Could not retrieve the leader gateway.
>>>
>>> So far, I have two different machines running the JobManager and,
>>> looking at the logs, I can't see any problem whatsoever to explain why the
>>> flink command is refusing to run the pipeline...
>>>
>>> Any ideas where I should look?
>>>
>>> --
>>> *Julio Biason*, Sofware Engineer
>>> *AZION*  |  Deliver. Accelerate. Protect.
>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>> <callto:+5551996209291>*99907 0554*
>>>
>>
>>
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>


-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: Cannot submit jobs on a HA Standalone JobManager

Reply via email to