Hi Derek, 1. I've created a JIRA issue to improve the docs as you recommended [1].
2. This discussion goes quite a bit into the internals of the HA setup. Let me pull in Till (in CC) who knows the details of HA. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9309 2018-05-05 15:34 GMT+02:00 Derek VerLee <derekver...@gmail.com>: > Two things: > > 1. It would be beneficial I think to drop a line somewhere in the docs > (probably on the production ready checklist as well as the HA page) > explaining that enabling zookeeper "highavailability" allows for your jobs > to restart automatically after a jobmanager crash or restart. We had spent > some cycles trying to implement job restarting and watchdogs (poorly) when > I discoverd this from a flink forward presentation on youtube. > > 2. I seem to have found some odd behavior with HA and then found something > that works, but I can't explain why. The clifnotes version is that I took > an existing standalone cluster with a single JM and modified with high > availability zookeeper mode. The same flink-conf.yaml file is used on all > nodes (including JM). This seemed to work fine, I restarted the JM (jm0) > and the jobs relaunched when it came back. Easy! Then I deployed a second > JM (jm1). Once I modified `masters`, set the HA rpc port range and opened > those ports on the firewall for both jobmanagers, but left > `jobmanager.rpc.address` the original value, `jm0` on all nodes. I then > observed that jm0 worked fine, taskmanagers connected to it and jobs ran. > jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no > tm). When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the > taskmanagers never attach to jm1. In the logs, all nodes, including jm1, > had messages about trying to reach jm0. From the documentation and various > comments I've seen, `jobmanager.rpc.address` should be ignored. However, > commenting it out entirely lead to jobmanagers crashing at boot, setting to > `localhost` caused all the taskmanagers to log messages about trying to > connect to the jobmanager at localhost. What finally worked was to set the > value to the hostname where the flink-conf.yaml was individually, even on > the taskmanagers. > > Does this seem like a bug? > > Just a hunch, but is there something called an "akka leader" that is > different from the jobmanager leader, and could it be somehow defaulting > its value over to jobmanager.rpc.address? > > >