Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for
the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the
start up scripts was actually released with Flink 1.5.0. That way, the host
name became the 2nd parameter. By calling the start up scripts with the old
syntax, the execution mode parameter was interpreted as the hostname. This
host name option was, however, not properly evaluated until we fixed it
with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well.
So far, we don't have good tooling which ensures that we don't introduce
breaking changes. In the future we need to be more careful!

Cheers,
Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote:

> Update on this:
>
> The issue was the command being used to start the jobmanager:
> `jobmanager.sh start-foreground cluster`.  This was a command leftover in
> our automation that used to be the correct way to start the JM -- however
> now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted
> as the hostname for the jobmanager to bind to.
>
> The solution was just to remove `cluster` from that command.
>
>
>
> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote:
>
>> Anybody else seen this and know the solution?  We're dead in the water
>> with Flink 1.5.4.
>>
>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote:
>>
>>> We started to see same errors after upgrading to flink 1.6.0 from 1.4.2.
>>> We
>>> have one JM and 5 TM on kubernetes. JM is running on HA mode.
>>> Taskmanagers
>>> sometimes are loosing connection to JM and having following error like
>>> you
>>> have.
>>>
>>> *2018-09-19 12:36:40,687 INFO
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
>>> resolve ResourceManager address
>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying
>>> in
>>> 10000 ms: Ask timed out on
>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
>>> message of
>>> type "akka.actor.Identify"..*
>>>
>>> When TM started to have "Could not resolve ResourceManager", it cannot
>>> resolve itself until I restart the TM pod.
>>>
>>> *Here is the content of our flink-conf.yaml:*
>>> blob.server.port: 6124
>>> jobmanager.rpc.address: flink-jobmanager
>>> jobmanager.rpc.port: 6123
>>> jobmanager.heap.mb: 4096
>>> jobmanager.web.history: 20
>>> jobmanager.archive.fs.dir: s3://our_path
>>> taskmanager.rpc.port: 6121
>>> taskmanager.heap.mb: 16384
>>> taskmanager.numberOfTaskSlots: 10
>>> taskmanager.log.path: /opt/flink/log/output.log
>>> web.log.path: /opt/flink/log/output.log
>>> state.checkpoints.num-retained: 3
>>> metrics.reporters: prom
>>> metrics.reporter.prom.class:
>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>
>>> high-availability: zookeeper
>>> high-availability.jobmanager.port: 50002
>>> high-availability.zookeeper.quorum: zookeeper_instance_list
>>> high-availability.zookeeper.path.root: /flink
>>> high-availability.cluster-id: profileservice
>>> high-availability.storageDir: s3://our_path
>>>
>>> Any help will be greatly appreciated!
>>>
>>>
>>>
>>> --
>>> Sent from:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>
>>

Reply via email to