Hi Zili, Sorry for replying late, we had a holiday here in the US.
We are using the high-availability.storageDir but only for the Blob store, however job graphs, checkpoints and checkpoint IDs are stored in MapDB. > On Aug 28, 2019, at 7:48 PM, Zili Chen <wander4...@gmail.com> wrote: > > Thanks for your email Aleksandar! Sorry for reply late. > > May I ask a question, do you config high-availability.storageDir in your > case? > That is, do you persist and retrieve job graph & checkpoint totally in MapDB > or, as ZooKeeper implementation does, persist them in an external filesystem > and just store a handle in MapDB? > > Best, > tison. > > > Aleksandar Mastilovic <amastilo...@sightmachine.com > <mailto:amastilo...@sightmachine.com>> 于2019年8月24日周六 上午7:04写道: > Hi all, > > Since I’m currently working on an implementation of > HighAvailabilityServicesFactory I thought it would be good to report here > about my experience so far. > > Our use case is cloud based, where we package Flink and our supplementary > code into a docker image, then run those images through Kubernetes+Helm > orchestration. > > We don’t use Hadoop nor HDFS but rather Google Cloud Storage, and we don’t > run ZooKeepers. Our Flink setup consists of one JobManager and multiple > TaskManagers on-demand. > > Due to the nature of cloud computing there’s a possibility our JobManager > instance might go down, only to be automatically recreated through > Kubernetes. Since we don’t run ZooKeeper > We needed a way to run a variant of High Availability cluster where we would > keep JobManager information on our attached persistent k8s volume instead of > ZooKeeper. > We found this > (https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538 > > <https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538>) > post on StackOverflow and decided to give it a try. > > So far we have a setup that seems to be working on our local deployment, we > haven’t yet tried it in the actual cloud. > > As far as implementation goes, here’s what we did: > > We used MapDB (mapdb.org <http://mapdb.org/>) as our storage format, to > persist lists of objects onto disk. We partially relied on > StandaloneHaServices for our HaServices implementation. Otherwise we looked > at the ZooKeeperHaServices and related classes for inspiration and guidance. > > Here’s a list of new classes: > > FileSystemCheckpointIDCounter implements CheckpointIDCounter > FileSystemCheckpointRecoveryFactory implements CheckpointRecoveryFactory > FileSystemCompletedCheckpointStore implements CompletedCheckpointStore > FileSystemHaServices extends StandaloneHaServices > FileSystemHaServicesFactory implements HighAvailabilityServicesFactory > FileSystemSubmittedJobGraphStore implements SubmittedJobGraphStore > > Testing so far proved that bringing down a JobManager and bringing it back up > does indeed restore all the running jobs. Job creation/destruction also > works. > > Hope this helps! > > Thanks, > Aleksandar Mastilovic > >> On Aug 21, 2019, at 12:32 AM, Zili Chen <wander4...@gmail.com >> <mailto:wander4...@gmail.com>> wrote: >> >> Hi guys, >> >> We want to have an accurate idea of how users actually use >> high-availability services in Flink, especially how you customize >> high-availability services by HighAvailabilityServicesFactory. >> >> Basically there are standalone impl., zookeeper impl., embedded impl. >> used in MiniCluster, YARN impl. not yet implemented, and a gate to >> customized implementations. >> >> Generally I think standalone impl. and zookeeper impl. are the most >> widely used implementations. The embedded impl. is used without >> awareness when users run a MiniCluster. >> >> Besides that, it is helpful to know how you guys customize >> high-availability services using HighAvailabilityServicesFactory >> interface for the ongoing FLINK-10333[1] which would evolve >> high-availability services in Flink. As well as whether there is any >> user take interest in the not yet implemented YARN impl.. >> >> Any user case should be helpful. I really appreciate your time and your >> insight. >> >> Best, >> tison. >> >> [1] https://issues.apache.org/jira/browse/FLINK-10333 >> <https://issues.apache.org/jira/browse/FLINK-10333>