Should we add a warning to the release announcements? Fabian
Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger < rmetz...@apache.org>: > Hey Jamie, > > we've been facing the same issue with dA Platform, when running Flink > 1.6.1. > I assume a lot of people will be affected by this. > > > > On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Jamie, >> >> thanks for the update on how to fix the problem. This is very helpful for >> the rest of the community. >> >> The change of removing the execution mode parameter (FLINK-8696) from the >> start up scripts was actually released with Flink 1.5.0. That way, the host >> name became the 2nd parameter. By calling the start up scripts with the old >> syntax, the execution mode parameter was interpreted as the hostname. This >> host name option was, however, not properly evaluated until we fixed it >> with Flink 1.5.4. Therefore, the problem only surfaced now. >> >> We definitely need to treat the start up scripts as a stable API as well. >> So far, we don't have good tooling which ensures that we don't introduce >> breaking changes. In the future we need to be more careful! >> >> Cheers, >> Till >> >> On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote: >> >>> Update on this: >>> >>> The issue was the command being used to start the jobmanager: >>> `jobmanager.sh start-foreground cluster`. This was a command leftover in >>> our automation that used to be the correct way to start the JM -- however >>> now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted >>> as the hostname for the jobmanager to bind to. >>> >>> The solution was just to remove `cluster` from that command. >>> >>> >>> >>> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote: >>> >>>> Anybody else seen this and know the solution? We're dead in the water >>>> with Flink 1.5.4. >>>> >>>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote: >>>> >>>>> We started to see same errors after upgrading to flink 1.6.0 from >>>>> 1.4.2. We >>>>> have one JM and 5 TM on kubernetes. JM is running on HA mode. >>>>> Taskmanagers >>>>> sometimes are loosing connection to JM and having following error like >>>>> you >>>>> have. >>>>> >>>>> *2018-09-19 12:36:40,687 INFO >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Could >>>>> not >>>>> resolve ResourceManager address >>>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, >>>>> retrying in >>>>> 10000 ms: Ask timed out on >>>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/), >>>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent >>>>> message of >>>>> type "akka.actor.Identify"..* >>>>> >>>>> When TM started to have "Could not resolve ResourceManager", it cannot >>>>> resolve itself until I restart the TM pod. >>>>> >>>>> *Here is the content of our flink-conf.yaml:* >>>>> blob.server.port: 6124 >>>>> jobmanager.rpc.address: flink-jobmanager >>>>> jobmanager.rpc.port: 6123 >>>>> jobmanager.heap.mb: 4096 >>>>> jobmanager.web.history: 20 >>>>> jobmanager.archive.fs.dir: s3://our_path >>>>> taskmanager.rpc.port: 6121 >>>>> taskmanager.heap.mb: 16384 >>>>> taskmanager.numberOfTaskSlots: 10 >>>>> taskmanager.log.path: /opt/flink/log/output.log >>>>> web.log.path: /opt/flink/log/output.log >>>>> state.checkpoints.num-retained: 3 >>>>> metrics.reporters: prom >>>>> metrics.reporter.prom.class: >>>>> org.apache.flink.metrics.prometheus.PrometheusReporter >>>>> >>>>> high-availability: zookeeper >>>>> high-availability.jobmanager.port: 50002 >>>>> high-availability.zookeeper.quorum: zookeeper_instance_list >>>>> high-availability.zookeeper.path.root: /flink >>>>> high-availability.cluster-id: profileservice >>>>> high-availability.storageDir: s3://our_path >>>>> >>>>> Any help will be greatly appreciated! >>>>> >>>>> >>>>> >>>>> -- >>>>> Sent from: >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >>>>> >>>>