Thanks for your email Aleksandar! Sorry for reply late.

May I ask a question, do you config high-availability.storageDir in your
case?
That is, do you persist and retrieve job graph & checkpoint totally in MapDB
or, as ZooKeeper implementation does, persist them in an external filesystem
and just store a handle in MapDB?

Best,
tison.


Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年8月24日周六 上午7:04写道:

> Hi all,
>
> Since I’m currently working on an implementation of
> HighAvailabilityServicesFactory I thought it would be good to report here
> about my experience so far.
>
> Our use case is cloud based, where we package Flink and our supplementary
> code into a docker image, then run those images through Kubernetes+Helm
> orchestration.
>
> We don’t use Hadoop nor HDFS but rather Google Cloud Storage, and we don’t
> run ZooKeepers. Our Flink setup consists of one JobManager and multiple
> TaskManagers on-demand.
>
> Due to the nature of cloud computing there’s a possibility our JobManager
> instance might go down, only to be automatically recreated through
> Kubernetes. Since we don’t run ZooKeeper
> We needed a way to run a variant of High Availability cluster where we
> would keep JobManager information on our attached persistent k8s volume
> instead of ZooKeeper.
> We found this (
> https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538)
> post on StackOverflow and decided to give it a try.
>
> So far we have a setup that seems to be working on our local deployment,
> we haven’t yet tried it in the actual cloud.
>
> As far as implementation goes, here’s what we did:
>
> We used MapDB (mapdb.org) as our storage format, to persist lists of
> objects onto disk. We partially relied on StandaloneHaServices for our
> HaServices implementation. Otherwise we looked at the ZooKeeperHaServices
> and related classes for inspiration and guidance.
>
> Here’s a list of new classes:
>
> FileSystemCheckpointIDCounter implements CheckpointIDCounter
> FileSystemCheckpointRecoveryFactory implements CheckpointRecoveryFactory
> FileSystemCompletedCheckpointStore implements CompletedCheckpointStore
> FileSystemHaServices extends StandaloneHaServices
> FileSystemHaServicesFactory implements HighAvailabilityServicesFactory
> FileSystemSubmittedJobGraphStore implements SubmittedJobGraphStore
>
> Testing so far proved that bringing down a JobManager and bringing it back
> up does indeed restore all the running jobs. Job creation/destruction also
> works.
>
> Hope this helps!
>
> Thanks,
> Aleksandar Mastilovic
>
> On Aug 21, 2019, at 12:32 AM, Zili Chen <wander4...@gmail.com> wrote:
>
> Hi guys,
>
> We want to have an accurate idea of how users actually use
> high-availability services in Flink, especially how you customize
> high-availability services by HighAvailabilityServicesFactory.
>
> Basically there are standalone impl., zookeeper impl., embedded impl.
> used in MiniCluster, YARN impl. not yet implemented, and a gate to
> customized implementations.
>
> Generally I think standalone impl. and zookeeper impl. are the most
> widely used implementations. The embedded impl. is used without
> awareness when users run a MiniCluster.
>
> Besides that, it is helpful to know how you guys customize
> high-availability services using HighAvailabilityServicesFactory
> interface for the ongoing FLINK-10333[1] which would evolve
> high-availability services in Flink. As well as whether there is any
> user take interest in the not yet implemented YARN impl..
>
> Any user case should be helpful. I really appreciate your time and your
> insight.
>
> Best,
> tison.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10333
>
>
>

Reply via email to