Alright, try to grab the logs if you see this problem reoccurring. I would be interested in understanding why this happens.
Cheers, Till On Fri, May 18, 2018 at 9:45 PM, Derek VerLee <derekver...@gmail.com> wrote: > Till, > > Thanks for the response. Sorry for the delayed reply. > > The flink version is 1.3.2, in stand alone mode. We'll probably upgrade > to 1.4, or directly to 1.5 once it is release in the very near future, and > I intend to migrate to running it on our Kubernetes cluster, and I will > probably run just on job manager as that seems to be the most frequent > recommendation. > > I'm not sure I have logs anymore ... we are very actively working against > our development environment and debug logs where crashing our log > aggregation service, so I had to stop forwarding them and turn on an > aggressive log rotate. We've been crunched under a deadline for our first > anomaly detection pipeline. > > At the time, nothing much jumped out in the logs to help me, except that > I did remember seeing some messages that seems to be looking for an "akka > leader" at whatever I put into the job manager rpc address at. I have this > in my search history "akka.actor.ActorNotFound". > Sorry I don't have something more useful. > > > On 5/13/18 3:50 PM, Till Rohrmann wrote: > > Hi Derek, > > given that you've started the different Flink cluster components all with > the same HA enabled configuration, the TMs should be able to connect to jm1 > after you've killed jm0. The jobmanager.rpc.address should not be used when > HA mode is enabled. > > In order to get to the bottom of the described problem, it would be > tremendously helpful to get access to the logs of all components (jm0, jm1 > and the TMs). Additionally, it would be good to know which Flink version > you're using. > > Cheers, > Till > > On Mon, May 7, 2018 at 2:38 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Derek, >> >> 1. I've created a JIRA issue to improve the docs as you recommended [1]. >> >> 2. This discussion goes quite a bit into the internals of the HA setup. >> Let me pull in Till (in CC) who knows the details of HA. >> >> Best, Fabian >> >> [1] https://issues.apache.org/jira/browse/FLINK-9309 >> >> 2018-05-05 15:34 GMT+02:00 Derek VerLee <derekver...@gmail.com>: >> >>> Two things: >>> >>> 1. It would be beneficial I think to drop a line somewhere in the docs >>> (probably on the production ready checklist as well as the HA page) >>> explaining that enabling zookeeper "highavailability" allows for your jobs >>> to restart automatically after a jobmanager crash or restart. We had spent >>> some cycles trying to implement job restarting and watchdogs (poorly) when >>> I discoverd this from a flink forward presentation on youtube. >>> >>> 2. I seem to have found some odd behavior with HA and then found >>> something that works, but I can't explain why. The clifnotes version is >>> that I took an existing standalone cluster with a single JM and modified >>> with high availability zookeeper mode. The same flink-conf.yaml file is >>> used on all nodes (including JM). This seemed to work fine, I restarted the >>> JM (jm0) and the jobs relaunched when it came back. Easy! Then I deployed >>> a second JM (jm1). Once I modified `masters`, set the HA rpc port range >>> and opened those ports on the firewall for both jobmanagers, but left >>> `jobmanager.rpc.address` the original value, `jm0` on all nodes. I then >>> observed that jm0 worked fine, taskmanagers connected to it and jobs ran. >>> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no >>> tm). When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the >>> taskmanagers never attach to jm1. In the logs, all nodes, including jm1, >>> had messages about trying to reach jm0. From the documentation and various >>> comments I've seen, `jobmanager.rpc.address` should be ignored. However, >>> commenting it out entirely lead to jobmanagers crashing at boot, setting to >>> `localhost` caused all the taskmanagers to log messages about trying to >>> connect to the jobmanager at localhost. What finally worked was to set the >>> value to the hostname where the flink-conf.yaml was individually, even on >>> the taskmanagers. >>> >>> Does this seem like a bug? >>> >>> Just a hunch, but is there something called an "akka leader" that is >>> different from the jobmanager leader, and could it be somehow defaulting >>> its value over to jobmanager.rpc.address? >>> >>> >>> >> > >