Alright, try to grab the logs if you see this problem reoccurring. I would
be interested in understanding why this happens.

Cheers,
Till

On Fri, May 18, 2018 at 9:45 PM, Derek VerLee <derekver...@gmail.com> wrote:

> Till,
>
> Thanks for the response.  Sorry for the delayed reply.
>
> The flink version is 1.3.2, in stand alone mode.  We'll probably upgrade
> to 1.4, or directly to 1.5 once it is release in the very near future, and
> I intend to migrate to running it on our Kubernetes cluster, and I will
> probably run just on job manager as that seems to be the most frequent
> recommendation.
>
> I'm not sure I have logs anymore ... we are very actively working against
> our development environment and debug logs where crashing our log
> aggregation service, so I had to stop forwarding them and turn on an
> aggressive log rotate.  We've been crunched under a deadline for our first
> anomaly detection pipeline.
>
>   At the time, nothing much jumped out in the logs to help me, except that
> I did remember seeing some messages that seems to be looking for an "akka
> leader" at whatever I put into the job manager rpc address at.  I have this
> in my search history "akka.actor.ActorNotFound".
> Sorry I don't have something more useful.
>
>
> On 5/13/18 3:50 PM, Till Rohrmann wrote:
>
> Hi Derek,
>
> given that you've started the different Flink cluster components all with
> the same HA enabled configuration, the TMs should be able to connect to jm1
> after you've killed jm0. The jobmanager.rpc.address should not be used when
> HA mode is enabled.
>
> In order to get to the bottom of the described problem, it would be
> tremendously helpful to get access to the logs of all components (jm0, jm1
> and the TMs). Additionally, it would be good to know which Flink version
> you're using.
>
> Cheers,
> Till
>
> On Mon, May 7, 2018 at 2:38 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Derek,
>>
>> 1. I've created a JIRA issue to improve the docs as you recommended [1].
>>
>> 2. This discussion goes quite a bit into the internals of the HA setup.
>> Let me pull in Till (in CC) who knows the details of HA.
>>
>> Best, Fabian
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9309
>>
>> 2018-05-05 15:34 GMT+02:00 Derek VerLee <derekver...@gmail.com>:
>>
>>> Two things:
>>>
>>> 1. It would be beneficial I think to drop a line somewhere in the docs
>>> (probably on the production ready checklist as well as the HA page)
>>> explaining that enabling zookeeper "highavailability" allows for your jobs
>>> to restart automatically after a jobmanager crash or restart.  We had spent
>>> some cycles trying to implement job restarting and watchdogs (poorly) when
>>> I discoverd this from a flink forward presentation on youtube.
>>>
>>> 2. I seem to have found some odd behavior with HA and then found
>>> something that works, but I can't explain why.  The clifnotes version is
>>> that I took an existing standalone cluster with a single JM and modified
>>> with high availability zookeeper mode.  The same flink-conf.yaml file is
>>> used on all nodes (including JM). This seemed to work fine, I restarted the
>>> JM (jm0) and the jobs relaunched when it came back.  Easy!  Then I deployed
>>> a second JM (jm1).  Once I modified `masters`, set the HA rpc port range
>>> and opened those ports on the firewall for both jobmanagers, but left
>>> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
>>> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
>>> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
>>> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
>>> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
>>> had messages about trying to reach jm0.  From the documentation and various
>>> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
>>> commenting it out entirely lead to jobmanagers crashing at boot, setting to
>>> `localhost` caused all the taskmanagers to log messages about trying to
>>> connect to the jobmanager at localhost.  What finally worked was to set the
>>> value to the hostname where the flink-conf.yaml was individually, even on
>>> the taskmanagers.
>>>
>>> Does this seem like a bug?
>>>
>>> Just a hunch, but is there something called an "akka leader" that is
>>> different from the jobmanager leader, and could it be somehow defaulting
>>> its value over to jobmanager.rpc.address?
>>>
>>>
>>>
>>
>
>

Reply via email to