Hi Tovi, you can define hard host attribute constraints for the TaskManagers. See the configuration section [1] for more information.
If you want to run the JobManager/cluster entry point on Mesos as well, then I recommend starting it with Marathon [2]. This will also give you HA for the master process. I assume that you can provide Marathon with similar constraints in order to control which Mesos task to allocate for the JobManager/cluster entry point process. [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/mesos.html#configuration-parameters [2] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/mesos.html#high-availability Cheers, Till On Tue, Jul 10, 2018 at 9:09 PM Sofer, Tovi <tovi.so...@citi.com> wrote: > To add one thing to Mesos question- > > My assumption that constraints on JobManager can work, is based on the > sentence from link bleow > > “When running Flink with Marathon, the whole Flink cluster including the > job manager will be run as Mesos tasks in the Mesos cluster.” > > > https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/mesos.html > > > > [Not sure this is accurate, since it seems to contradict the image in link > below > > https://mesosphere.com/blog/apache-flink-on-dcos-and-apache-mesos ] > > > > *From:* Sofer, Tovi [ICG-IT] > *Sent:* יום ג 10 יולי 2018 20:04 > *To:* 'Till Rohrmann' <trohrm...@apache.org>; user <user@flink.apache.org> > *Cc:* Gardi, Hila [ICG-IT] <hg11...@imceu.eu.ssmb.com> > *Subject:* RE: high availability with automated disaster recovery using > zookeeper > > > > Hi Till, group, > > > > Thank you for your response. > > After reading further online on Mesos – Can’t Mesos fill the requirement > of running job manager in primary server? > > By using: “constraints”: [[“datacenter”, “CLUSTER”, “main”]] > > (See > http://www.stratio.com/blog/mesos-multi-data-center-architecture-for-disaster-recovery/ > ) > > > > Is this supported by Flink cluster on Mesos ? > > > > Thanks again > > Tovi > > > > *From:* Till Rohrmann <trohrm...@apache.org> > *Sent:* יום ג 10 יולי 2018 10:11 > *To:* Sofer, Tovi [ICG-IT] <ts72...@imceu.eu.ssmb.com> > *Cc:* user <user@flink.apache.org> > *Subject:* Re: high availability with automated disaster recovery using > zookeeper > > > > Hi Tovi, > > > > that is an interesting use case you are describing here. I think, however, > it depends mainly on the capabilities of ZooKeeper to produce the intended > behavior. Flink itself relies on ZooKeeper for leader election in HA mode > but does not expose any means to influence the leader election process. To > be more precise ZK is used as a blackbox which simply tells a JobManager > that it is now the leader, independent of any data center preferences. I'm > not sure whether it is possible to tell ZooKeeper about these preferences. > If not, then an alternative could be to implement one's own high > availability services which does that at the moment. > > > > Cheers, > > Till > > > > On Mon, Jul 9, 2018 at 1:48 PM Sofer, Tovi <tovi.so...@citi.com> wrote: > > Hi all, > > > > We are now examining how to achieve high availability for Flink, and to > support also automatic recovery in disaster scenario- when all DC goes down. > > We have DC1 which we usually want work to be done, and DC2 – which is more > remote and we want work to go there only when DC1 is down. > > > > We examined few options and would be glad to hear feedback a suggestion > for another way to achieve this. > > · Two zookeeper separate zookeeper and flink clusters on the two > data centers. > > Only the cluster on DC1 are running, and state is copied to DC2 in offline > process. > > To achieve automatic recovery we need to use some king of watch dog which > will check DC1 availability , and if it is down will start DC2 (and same > later if DC2 is down). > > Is there recommended tool for this? > > · Zookeeper “stretch cluster” cross data centers – with 2 nodes > on DC1, 2 nodes on DC2 and one observer node. > > Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2. > > This way when DC1 is down, zookeeper will notice this automatically and > will transfer work to jobmanager2 on DC2. > > However we would like zookeeper leader, and flink jobmanager leader > (primary one) to be from DC1 – unless it is down. > > Is there a way to achieve this? > > > > Thanks and regards, > > [image: citi_logo_mail] > > *Tovi Sofer* > > Software Engineer > +972 (3) 7405756 > > [image: Mail_signature_blue] > > > >