FW: high availability with automated disaster recovery using zookeeper

Sofer, Tovi Mon, 16 Jul 2018 05:11:56 -0700

Thank you Scott,
Looks like a very elegant solution.

How did you manage high availability in single data center?

Thanks,
Tovi

From: Scott Kidder <kidder.sc...@gmail.com>
Sent: יום ו 13 יולי 2018 01:13
To: Sofer, Tovi [ICG-IT] <ts72...@imceu.eu.ssmb.com>
Cc: user@flink.apache.org
Subject: Re: high availability with automated disaster recovery using zookeeper

I've used a multi-datacenter Consul cluster used to coordinate 
service-discovery. When a service starts up in the primary DC, it registers 
itself in Consul with a key that has a TTL that must be periodically renewed. 
If the service shuts down or terminates abruptly, the key expires and is 
removed from Consul. A standby service in another DC can be started 
automatically after detecting the absence of the key in Consul in the primary 
DC. This could lead to submitting a job to the standby Flink cluster from the 
most recent savepoint that was copied by the offline process you mentioned. It 
should be pretty easy to automate all of this. I would not recommend setting up 
a multi-datacenter Zookeeper cluster; in my experience, Consul is much easier 
to work with.

Best,

--
Scott Kidder

On Mon, Jul 9, 2018 at 4:48 AM Sofer, Tovi 
<tovi.so...@citi.com<mailto:tovi.so...@citi.com>> wrote:
Hi all,

We are now examining how to achieve high availability for Flink, and to support 
also automatic recovery in disaster scenario- when all DC goes down.
We have DC1 which we usually want work to be done, and DC2 – which is more 
remote and we want work to go there only when DC1 is down.

We examined few options and would be glad to hear feedback a suggestion for 
another way to achieve this.

•         Two zookeeper separate zookeeper and flink clusters on the two data 
centers.
Only the cluster on DC1 are running, and state is copied to DC2 in offline 
process.

To achieve automatic recovery we need to use some king of watch dog which will 
check DC1 availability , and if it is down will start DC2 (and same later if 
DC2 is down).

Is there recommended tool for this?

•         Zookeeper “stretch cluster” cross data centers – with 2 nodes on DC1, 
2 nodes on DC2 and one observer node.

Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.

This way when DC1 is down, zookeeper will notice this automatically and will 
transfer work to jobmanager2 on DC2.

However we would like zookeeper leader, and flink jobmanager leader (primary 
one) to be from DC1 – unless it is down.

Is there a way to achieve this?

Thanks and regards,
[citi_logo_mail]
Tovi Sofer
Software Engineer
+972 (3) 7405756
[Mail_signature_blue]

FW: high availability with automated disaster recovery using zookeeper

Reply via email to