Re: Leader Retrieval Timeout with HA Job Manager

2018-05-23 Thread Till Rohrmann
Hi Jason, sorry for the late response. The logs look indeed strange because both JMs are granted leadership without the other getting its leadership revoked. What would be interesting is to take a look at the Znode under `/flink/flink-ha/leader//job_manager_lock` in

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania
Thanks for your help. The job manager launch on two nodes of the cluster is provided as well as the logs for the task managers, one that worked and one that could not seem to find the find an address which I am assuming is for the job manager. The logs are from nodes aaa-1 and aaa-2. Thanks,

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Till Rohrmann
Hi Jason, the client logs would indeed be very interesting to further debug this problem. What you have to make sure is that the client has the same HA configuration settings as the cluster because the client needs to talk to your ZooKeeper quorum in order to retrieve the leader address. When exe

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther
Can you change the log level to DEBUG and share the logs with us? Maybe Till (in CC) has some idea? Regards, Timo Am 15.05.18 um 15:18 schrieb Jason Kania: Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R&

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania
Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R&D purposes. The thing is that we also have Kafka and Cassandra on the same nodes and they have no issues in this environment, it is just Flink in an HA configu

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther
Hi Jason, this sounds more like a network connection/firewall issue to me. Can you tell us a bit more about your environment? Are you running your Flink cluster on a cloud provider? Regards, Timo Am 15.05.18 um 05:15 schrieb Jason Kania: Hi, I am using the 1.4.2 release on ubuntu and atte

Leader Retrieval Timeout with HA Job Manager

2018-05-14 Thread Jason Kania
Hi, I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job Manager, but unfortunately using HA functionality prevents job submission with the following error: java.lang.RuntimeException: Failed to retrieve JobManager address     at org.apache.flink.client.program.Cl