Re: high availability with automated disaster recovery using zookeeper

Till Rohrmann Thu, 12 Jul 2018 02:20:02 -0700

Hi Tovi,

you can define hard host attribute constraints for the TaskManagers. See
the configuration section [1] for more information.


If you want to run the JobManager/cluster entry point on Mesos as well,
then I recommend starting it with Marathon [2]. This will also give you HA
for the master process. I assume that you can provide Marathon with similar
constraints in order to control which Mesos task to allocate for the
JobManager/cluster entry point process.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/mesos.html#configuration-parameters
[2]
https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/mesos.html#high-availability

Cheers,
Till

On Tue, Jul 10, 2018 at 9:09 PM Sofer, Tovi <tovi.so...@citi.com> wrote:

> To add one thing to Mesos question-
>
> My assumption that  constraints on JobManager  can work, is based on the
> sentence from link bleow
>
> “When running Flink with Marathon, the whole Flink cluster including the
> job manager will be run as Mesos tasks in the Mesos cluster.”
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/mesos.html
>
>
>
> [Not sure this is accurate, since it seems to contradict the image in link
> below
>
> https://mesosphere.com/blog/apache-flink-on-dcos-and-apache-mesos ]
>
>
>
> *From:* Sofer, Tovi [ICG-IT]
> *Sent:* יום ג 10 יולי 2018 20:04
> *To:* 'Till Rohrmann' <trohrm...@apache.org>; user <user@flink.apache.org>
> *Cc:* Gardi, Hila [ICG-IT] <hg11...@imceu.eu.ssmb.com>
> *Subject:* RE: high availability with automated disaster recovery using
> zookeeper
>
>
>
> Hi Till, group,
>
>
>
> Thank you for your response.
>
> After reading further online on Mesos – Can’t Mesos fill the requirement
> of running job manager in primary server?
>
> By using: “constraints”: [[“datacenter”, “CLUSTER”, “main”]]
>
> (See
> http://www.stratio.com/blog/mesos-multi-data-center-architecture-for-disaster-recovery/
> )
>
>
>
> Is this supported by Flink cluster on Mesos ?
>
>
>
> Thanks again
>
> Tovi
>
>
>
> *From:* Till Rohrmann <trohrm...@apache.org>
> *Sent:* יום ג 10 יולי 2018 10:11
> *To:* Sofer, Tovi [ICG-IT] <ts72...@imceu.eu.ssmb.com>
> *Cc:* user <user@flink.apache.org>
> *Subject:* Re: high availability with automated disaster recovery using
> zookeeper
>
>
>
> Hi Tovi,
>
>
>
> that is an interesting use case you are describing here. I think, however,
> it depends mainly on the capabilities of ZooKeeper to produce the intended
> behavior. Flink itself relies on ZooKeeper for leader election in HA mode
> but does not expose any means to influence the leader election process. To
> be more precise ZK is used as a blackbox which simply tells a JobManager
> that it is now the leader, independent of any data center preferences. I'm
> not sure whether it is possible to tell ZooKeeper about these preferences.
> If not, then an alternative could be to implement one's own high
> availability services which does that at the moment.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Mon, Jul 9, 2018 at 1:48 PM Sofer, Tovi <tovi.so...@citi.com> wrote:
>
> Hi all,
>
>
>
> We are now examining how to achieve high availability for Flink, and to
> support also automatic recovery in disaster scenario- when all DC goes down.
>
> We have DC1 which we usually want work to be done, and DC2 – which is more
> remote and we want work to go there only when DC1 is down.
>
>
>
> We examined few options and would be glad to hear feedback a suggestion
> for another way to achieve this.
>
> ·         Two zookeeper separate zookeeper and flink clusters on the two
> data centers.
>
> Only the cluster on DC1 are running, and state is copied to DC2 in offline
> process.
>
> To achieve automatic recovery we need to use some king of watch dog which
> will check DC1 availability , and if it is down will start DC2 (and same
> later if DC2 is down).
>
> Is there recommended tool for this?
>
> ·         Zookeeper “stretch cluster” cross data centers – with 2 nodes
> on DC1, 2 nodes on DC2 and one observer node.
>
> Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.
>
> This way when DC1 is down, zookeeper will notice this automatically and
> will transfer work to jobmanager2 on DC2.
>
> However we would like zookeeper leader, and flink jobmanager leader
> (primary one) to be from DC1 – unless it is down.
>
> Is there a way to achieve this?
>
>
>
> Thanks and regards,
>
> [image: citi_logo_mail]
>
> *Tovi Sofer*
>
> Software Engineer
> +972 (3) 7405756
>
> [image: Mail_signature_blue]
>
>
>
>

Re: high availability with automated disaster recovery using zookeeper

Reply via email to