Thanks Yangze for starting this discussion. I have one comment: why do we need to abstract two services as `LeaderServices` and `PersistenceServices`?
>From the content, the purpose of this FLIP is to make job failover more lightweight, so it would be more appropriate to abstract two services as `ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead of `LeaderServices` and `PersistenceServices` based on leader and store. In this way, we can create a `JobHighAvailabilityService` that has a leader service and store for the job that meets the requirements based on the configuration in the zk/k8s high availability service. WDYT? Best, Fang Yong On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <xiangyu...@gmail.com> wrote: > Thanks Yangze for restart this discussion. > > +1 for the overall idea. By splitting the HighAvailabilityServices into > LeaderServices and PersistenceServices, we may support configuring > different storage behind them in the future. > > We did run into real problems in production where too much job metadata was > being stored on ZK, causing system instability. > > > Yangze Guo <karma...@gmail.com> 于2023年12月29日周五 10:21写道: > > > Thanks for the response, Zhanghao. > > > > PersistenceServices sounds good to me. > > > > Best, > > Yangze Guo > > > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen > > <zhanghao.c...@outlook.com> wrote: > > > > > > Thanks for driving this effort, Yangze! The proposal overall LGTM. > Other > > from the throughput enhancement in the OLAP scenario, the separation of > > leader election/discovery services and the metadata persistence services > > will also make the HA impl clearer and easier to maintain. Just a minor > > comment on naming: would it better to rename PersistentServices to > > PersistenceServices, as usually we put a noun before Services? > > > > > > Best, > > > Zhanghao Chen > > > ________________________________ > > > From: Yangze Guo <karma...@gmail.com> > > > Sent: Tuesday, December 19, 2023 17:33 > > > To: dev <dev@flink.apache.org> > > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP > > Scenarios > > > > > > Hi, there, > > > > > > We would like to start a discussion thread on "FLIP-403: High > > > Availability Services for OLAP Scenarios"[1]. > > > > > > Currently, Flink's high availability service consists of two > > > mechanisms: leader election/retrieval services for JobManager and > > > persistent services for job metadata. However, these mechanisms are > > > set up in an "all or nothing" manner. In OLAP scenarios, we typically > > > only require leader election/retrieval services for JobManager > > > components since jobs usually do not have a restart strategy. > > > Additionally, the persistence of job states can negatively impact the > > > cluster's throughput, especially for short query jobs. > > > > > > To address these issues, this FLIP proposes splitting the > > > HighAvailabilityServices into LeaderServices and PersistentServices, > > > and enable users to independently configure the high availability > > > strategies specifically related to jobs. > > > > > > Please find more details in the FLIP wiki document [1]. Looking > > > forward to your feedback. > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios > > > > > > Best, > > > Yangze Guo > > >