Re: FW: high availability with automated disaster recovery using zookeeper

Scott Kidder Mon, 16 Jul 2018 18:28:21 -0700

Hi Tovi, we run all services (Flink, Zookeeper, Hadoop HDFS, and Consul) in
a Kubernetes cluster in each data center. Kubernetes will automatically
restart/reschedule any services that crash or become unhealthy. This is a
little outside the scope of Flink, and I'd be happy to discuss it further
off-list.


Best,

--Scott Kidder

On Mon, Jul 16, 2018 at 5:11 AM Sofer, Tovi <tovi.so...@citi.com> wrote:

> Thank you Scott,
>
> Looks like a very elegant solution.
>
>
>
> How did you manage high availability in single data center?
>
>
>
> Thanks,
>
> Tovi
>
>
>
> *From:* Scott Kidder <kidder.sc...@gmail.com>
> *Sent:* יום ו 13 יולי 2018 01:13
> *To:* Sofer, Tovi [ICG-IT] <ts72...@imceu.eu.ssmb.com>
> *Cc:* user@flink.apache.org
> *Subject:* Re: high availability with automated disaster recovery using
> zookeeper
>
>
>
> I've used a multi-datacenter Consul cluster used to coordinate
> service-discovery. When a service starts up in the primary DC, it registers
> itself in Consul with a key that has a TTL that must be periodically
> renewed. If the service shuts down or terminates abruptly, the key expires
> and is removed from Consul. A standby service in another DC can be started
> automatically after detecting the absence of the key in Consul in the
> primary DC. This could lead to submitting a job to the standby Flink
> cluster from the most recent savepoint that was copied by the offline
> process you mentioned. It should be pretty easy to automate all of this. I
> would not recommend setting up a multi-datacenter Zookeeper cluster; in my
> experience, Consul is much easier to work with.
>
>
>
> Best,
>
>
>
> --
>
> Scott Kidder
>
>
>
>
>
> On Mon, Jul 9, 2018 at 4:48 AM Sofer, Tovi <tovi.so...@citi.com> wrote:
>
> Hi all,
>
>
>
> We are now examining how to achieve high availability for Flink, and to
> support also automatic recovery in disaster scenario- when all DC goes down.
>
> We have DC1 which we usually want work to be done, and DC2 – which is more
> remote and we want work to go there only when DC1 is down.
>
>
>
> We examined few options and would be glad to hear feedback a suggestion
> for another way to achieve this.
>
> ·         Two zookeeper separate zookeeper and flink clusters on the two
> data centers.
>
> Only the cluster on DC1 are running, and state is copied to DC2 in offline
> process.
>
> To achieve automatic recovery we need to use some king of watch dog which
> will check DC1 availability , and if it is down will start DC2 (and same
> later if DC2 is down).
>
> Is there recommended tool for this?
>
> ·         Zookeeper “stretch cluster” cross data centers – with 2 nodes
> on DC1, 2 nodes on DC2 and one observer node.
>
> Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.
>
> This way when DC1 is down, zookeeper will notice this automatically and
> will transfer work to jobmanager2 on DC2.
>
> However we would like zookeeper leader, and flink jobmanager leader
> (primary one) to be from DC1 – unless it is down.
>
> Is there a way to achieve this?
>
>
>
> Thanks and regards,
>
> [image: citi_logo_mail]
>
> *Tovi Sofer*
>
> Software Engineer
> +972 (3) 7405756
>
> [image: Mail_signature_blue]
>
>
>
>

Re: FW: high availability with automated disaster recovery using zookeeper

Reply via email to