Alright, just came across the first real-life problem with my Consul HA implementation. In Consul KV store there is a limit of 512kB per node and JobGraph of one of my apps exceeded it. In ZK there seems to be similar zNode Limit = 1MB How did you workaround it? Or maybe I serialize the JobGraph wrong?
On Thu, Feb 15, 2018 at 8:47 AM, Krzysztof Białek <krzysiek.bia...@gmail.com > wrote: > I have very little experience with ZK and cannot explain the differences > between ZK and Consul by myself. However there are some comparisions > available: > * https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may > be biased > * https://www.slideshare.net/IvanGlushkov/zookeeper-vs-consul-41882991 > * https://jakon.me/2017/01/consul-deployment-orchestration/ > > Regarding testing - I did basic failover scenarios on my workstation with > 2 JobManagers, 2 TaskManagers and WindowJoin example app with checkpointing > and restarting turned on. > I was running the cluster no longer than for few hours. > > For now I'd like to open Flink for alternative HA backends ( > https://issues.apache.org/jira/browse/FLINK-8660) > > > On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <ches...@apache.org> > wrote: > >> Hello, >> >> I don't know anything about Consul but the prospect of having other >> options beside Zookeeper is very interesting. It's rather surprising how >> little you had to modify existing classes to get this to work. >> >> It may take a bit until someone provides proper feedback as the community >> is currently prepping 2 releases (1.4.1 and 1.5), please don't be >> discouraged by this. >> >> I saw that your branch was based on the 1.4 version. In 1.5 we reworked >> the distributed architecture of Flink (in an initiative commonly referred >> to as FLIP-6) which may affect your work. >> >> 2 things to note from my side: >> It would also be helpful if you could explain the differences between ZK >> and Consul and how they stack up in terms of guarantees etc. . >> How did you test your solution so far? (Like how long was a cluster >> running, what failure scenarios) >> >> >> On 13.02.2018 21:38, Krzysztof Białek wrote: >> >> I'd like to get your opinion about this idea. I found related JIRA issue >> FLINK-2366, >> but it seems to be dead. To attract your attention I copy my comment here. >> >> As an experiment I've implemented Flink HA on top of Consul. The >> implementation is working fine in the "lab" but is not battle tested yet. >> The source code is available at https://github.com/kbialek/ >> flink/tree/feature/consul (flink-runtime package >> org.apache.flink.runtime.consul) >> >> Why?. Generally I'd like to keep as less moving parts as possible. We do >> not have Zookeeper running, but Consul is already in place. And in the end >> freedom of choice is a good thing. >> >> It would be great to see built-in Consul support in Flink someday, but if >> it is not expected then I suggest a little refactoring to open possibility >> to configure HighAvailabilityServicesFactory. As far as I can see this >> should be enough to inject any HA implementation. >> >> Regards, >> Krzysztof >> >> >> >