Thanks for your email Aleksandar! Sorry for reply late. May I ask a question, do you config high-availability.storageDir in your case? That is, do you persist and retrieve job graph & checkpoint totally in MapDB or, as ZooKeeper implementation does, persist them in an external filesystem and just store a handle in MapDB?
Best, tison. Aleksandar Mastilovic <amastilo...@sightmachine.com> 于2019年8月24日周六 上午7:04写道: > Hi all, > > Since I’m currently working on an implementation of > HighAvailabilityServicesFactory I thought it would be good to report here > about my experience so far. > > Our use case is cloud based, where we package Flink and our supplementary > code into a docker image, then run those images through Kubernetes+Helm > orchestration. > > We don’t use Hadoop nor HDFS but rather Google Cloud Storage, and we don’t > run ZooKeepers. Our Flink setup consists of one JobManager and multiple > TaskManagers on-demand. > > Due to the nature of cloud computing there’s a possibility our JobManager > instance might go down, only to be automatically recreated through > Kubernetes. Since we don’t run ZooKeeper > We needed a way to run a variant of High Availability cluster where we > would keep JobManager information on our attached persistent k8s volume > instead of ZooKeeper. > We found this ( > https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538) > post on StackOverflow and decided to give it a try. > > So far we have a setup that seems to be working on our local deployment, > we haven’t yet tried it in the actual cloud. > > As far as implementation goes, here’s what we did: > > We used MapDB (mapdb.org) as our storage format, to persist lists of > objects onto disk. We partially relied on StandaloneHaServices for our > HaServices implementation. Otherwise we looked at the ZooKeeperHaServices > and related classes for inspiration and guidance. > > Here’s a list of new classes: > > FileSystemCheckpointIDCounter implements CheckpointIDCounter > FileSystemCheckpointRecoveryFactory implements CheckpointRecoveryFactory > FileSystemCompletedCheckpointStore implements CompletedCheckpointStore > FileSystemHaServices extends StandaloneHaServices > FileSystemHaServicesFactory implements HighAvailabilityServicesFactory > FileSystemSubmittedJobGraphStore implements SubmittedJobGraphStore > > Testing so far proved that bringing down a JobManager and bringing it back > up does indeed restore all the running jobs. Job creation/destruction also > works. > > Hope this helps! > > Thanks, > Aleksandar Mastilovic > > On Aug 21, 2019, at 12:32 AM, Zili Chen <wander4...@gmail.com> wrote: > > Hi guys, > > We want to have an accurate idea of how users actually use > high-availability services in Flink, especially how you customize > high-availability services by HighAvailabilityServicesFactory. > > Basically there are standalone impl., zookeeper impl., embedded impl. > used in MiniCluster, YARN impl. not yet implemented, and a gate to > customized implementations. > > Generally I think standalone impl. and zookeeper impl. are the most > widely used implementations. The embedded impl. is used without > awareness when users run a MiniCluster. > > Besides that, it is helpful to know how you guys customize > high-availability services using HighAvailabilityServicesFactory > interface for the ongoing FLINK-10333[1] which would evolve > high-availability services in Flink. As well as whether there is any > user take interest in the not yet implemented YARN impl.. > > Any user case should be helpful. I really appreciate your time and your > insight. > > Best, > tison. > > [1] https://issues.apache.org/jira/browse/FLINK-10333 > > >