Hi Tovi, we run all services (Flink, Zookeeper, Hadoop HDFS, and Consul) in a Kubernetes cluster in each data center. Kubernetes will automatically restart/reschedule any services that crash or become unhealthy. This is a little outside the scope of Flink, and I'd be happy to discuss it further off-list.
Best, --Scott Kidder On Mon, Jul 16, 2018 at 5:11 AM Sofer, Tovi <tovi.so...@citi.com> wrote: > Thank you Scott, > > Looks like a very elegant solution. > > > > How did you manage high availability in single data center? > > > > Thanks, > > Tovi > > > > *From:* Scott Kidder <kidder.sc...@gmail.com> > *Sent:* יום ו 13 יולי 2018 01:13 > *To:* Sofer, Tovi [ICG-IT] <ts72...@imceu.eu.ssmb.com> > *Cc:* user@flink.apache.org > *Subject:* Re: high availability with automated disaster recovery using > zookeeper > > > > I've used a multi-datacenter Consul cluster used to coordinate > service-discovery. When a service starts up in the primary DC, it registers > itself in Consul with a key that has a TTL that must be periodically > renewed. If the service shuts down or terminates abruptly, the key expires > and is removed from Consul. A standby service in another DC can be started > automatically after detecting the absence of the key in Consul in the > primary DC. This could lead to submitting a job to the standby Flink > cluster from the most recent savepoint that was copied by the offline > process you mentioned. It should be pretty easy to automate all of this. I > would not recommend setting up a multi-datacenter Zookeeper cluster; in my > experience, Consul is much easier to work with. > > > > Best, > > > > -- > > Scott Kidder > > > > > > On Mon, Jul 9, 2018 at 4:48 AM Sofer, Tovi <tovi.so...@citi.com> wrote: > > Hi all, > > > > We are now examining how to achieve high availability for Flink, and to > support also automatic recovery in disaster scenario- when all DC goes down. > > We have DC1 which we usually want work to be done, and DC2 – which is more > remote and we want work to go there only when DC1 is down. > > > > We examined few options and would be glad to hear feedback a suggestion > for another way to achieve this. > > · Two zookeeper separate zookeeper and flink clusters on the two > data centers. > > Only the cluster on DC1 are running, and state is copied to DC2 in offline > process. > > To achieve automatic recovery we need to use some king of watch dog which > will check DC1 availability , and if it is down will start DC2 (and same > later if DC2 is down). > > Is there recommended tool for this? > > · Zookeeper “stretch cluster” cross data centers – with 2 nodes > on DC1, 2 nodes on DC2 and one observer node. > > Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2. > > This way when DC1 is down, zookeeper will notice this automatically and > will transfer work to jobmanager2 on DC2. > > However we would like zookeeper leader, and flink jobmanager leader > (primary one) to be from DC1 – unless it is down. > > Is there a way to achieve this? > > > > Thanks and regards, > > [image: citi_logo_mail] > > *Tovi Sofer* > > Software Engineer > +972 (3) 7405756 > > [image: Mail_signature_blue] > > > >