Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Fabian Hueske Wed, 26 Sep 2018 01:25:03 -0700

Should we add a warning to the release announcements?

Fabian


Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <
rmetz...@apache.org>:

> Hey Jamie,
>
> we've been facing the same issue with dA Platform, when running Flink
> 1.6.1.
> I assume a lot of people will be affected by this.
>
>
>
> On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Jamie,
>>
>> thanks for the update on how to fix the problem. This is very helpful for
>> the rest of the community.
>>
>> The change of removing the execution mode parameter (FLINK-8696) from the
>> start up scripts was actually released with Flink 1.5.0. That way, the host
>> name became the 2nd parameter. By calling the start up scripts with the old
>> syntax, the execution mode parameter was interpreted as the hostname. This
>> host name option was, however, not properly evaluated until we fixed it
>> with Flink 1.5.4. Therefore, the problem only surfaced now.
>>
>> We definitely need to treat the start up scripts as a stable API as well.
>> So far, we don't have good tooling which ensures that we don't introduce
>> breaking changes. In the future we need to be more careful!
>>
>> Cheers,
>> Till
>>
>> On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote:
>>
>>> Update on this:
>>>
>>> The issue was the command being used to start the jobmanager:
>>> `jobmanager.sh start-foreground cluster`.  This was a command leftover in
>>> our automation that used to be the correct way to start the JM -- however
>>> now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted
>>> as the hostname for the jobmanager to bind to.
>>>
>>> The solution was just to remove `cluster` from that command.
>>>
>>>
>>>
>>> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote:
>>>
>>>> Anybody else seen this and know the solution?  We're dead in the water
>>>> with Flink 1.5.4.
>>>>
>>>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote:
>>>>
>>>>> We started to see same errors after upgrading to flink 1.6.0 from
>>>>> 1.4.2. We
>>>>> have one JM and 5 TM on kubernetes. JM is running on HA mode.
>>>>> Taskmanagers
>>>>> sometimes are loosing connection to JM and having following error like
>>>>> you
>>>>> have.
>>>>>
>>>>> *2018-09-19 12:36:40,687 INFO
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could
>>>>> not
>>>>> resolve ResourceManager address
>>>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager,
>>>>> retrying in
>>>>> 10000 ms: Ask timed out on
>>>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
>>>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
>>>>> message of
>>>>> type "akka.actor.Identify"..*
>>>>>
>>>>> When TM started to have "Could not resolve ResourceManager", it cannot
>>>>> resolve itself until I restart the TM pod.
>>>>>
>>>>> *Here is the content of our flink-conf.yaml:*
>>>>> blob.server.port: 6124
>>>>> jobmanager.rpc.address: flink-jobmanager
>>>>> jobmanager.rpc.port: 6123
>>>>> jobmanager.heap.mb: 4096
>>>>> jobmanager.web.history: 20
>>>>> jobmanager.archive.fs.dir: s3://our_path
>>>>> taskmanager.rpc.port: 6121
>>>>> taskmanager.heap.mb: 16384
>>>>> taskmanager.numberOfTaskSlots: 10
>>>>> taskmanager.log.path: /opt/flink/log/output.log
>>>>> web.log.path: /opt/flink/log/output.log
>>>>> state.checkpoints.num-retained: 3
>>>>> metrics.reporters: prom
>>>>> metrics.reporter.prom.class:
>>>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>>>
>>>>> high-availability: zookeeper
>>>>> high-availability.jobmanager.port: 50002
>>>>> high-availability.zookeeper.quorum: zookeeper_instance_list
>>>>> high-availability.zookeeper.path.root: /flink
>>>>> high-availability.cluster-id: profileservice
>>>>> high-availability.storageDir: s3://our_path
>>>>>
>>>>> Any help will be greatly appreciated!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>
>>>>

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Reply via email to