I've used a multi-datacenter Consul cluster used to coordinate service-discovery. When a service starts up in the primary DC, it registers itself in Consul with a key that has a TTL that must be periodically renewed. If the service shuts down or terminates abruptly, the key expires and is removed from Consul. A standby service in another DC can be started automatically after detecting the absence of the key in Consul in the primary DC. This could lead to submitting a job to the standby Flink cluster from the most recent savepoint that was copied by the offline process you mentioned. It should be pretty easy to automate all of this. I would not recommend setting up a multi-datacenter Zookeeper cluster; in my experience, Consul is much easier to work with.
Best, -- Scott Kidder On Mon, Jul 9, 2018 at 4:48 AM Sofer, Tovi <tovi.so...@citi.com> wrote: > Hi all, > > > > We are now examining how to achieve high availability for Flink, and to > support also automatic recovery in disaster scenario- when all DC goes down. > > We have DC1 which we usually want work to be done, and DC2 – which is more > remote and we want work to go there only when DC1 is down. > > > > We examined few options and would be glad to hear feedback a suggestion > for another way to achieve this. > > · Two zookeeper separate zookeeper and flink clusters on the two > data centers. > > Only the cluster on DC1 are running, and state is copied to DC2 in offline > process. > > To achieve automatic recovery we need to use some king of watch dog which > will check DC1 availability , and if it is down will start DC2 (and same > later if DC2 is down). > > Is there recommended tool for this? > > · Zookeeper “stretch cluster” cross data centers – with 2 nodes > on DC1, 2 nodes on DC2 and one observer node. > > Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2. > > This way when DC1 is down, zookeeper will notice this automatically and > will transfer work to jobmanager2 on DC2. > > However we would like zookeeper leader, and flink jobmanager leader > (primary one) to be from DC1 – unless it is down. > > Is there a way to achieve this? > > > > Thanks and regards, > > [image: citi_logo_mail] > > *Tovi Sofer* > > Software Engineer > +972 (3) 7405756 > > [image: Mail_signature_blue] > > >