Thanks Yangze for restart this discussion. +1 for the overall idea. By splitting the HighAvailabilityServices into LeaderServices and PersistenceServices, we may support configuring different storage behind them in the future.
We did run into real problems in production where too much job metadata was being stored on ZK, causing system instability. Yangze Guo <karma...@gmail.com> 于2023年12月29日周五 10:21写道: > Thanks for the response, Zhanghao. > > PersistenceServices sounds good to me. > > Best, > Yangze Guo > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen > <zhanghao.c...@outlook.com> wrote: > > > > Thanks for driving this effort, Yangze! The proposal overall LGTM. Other > from the throughput enhancement in the OLAP scenario, the separation of > leader election/discovery services and the metadata persistence services > will also make the HA impl clearer and easier to maintain. Just a minor > comment on naming: would it better to rename PersistentServices to > PersistenceServices, as usually we put a noun before Services? > > > > Best, > > Zhanghao Chen > > ________________________________ > > From: Yangze Guo <karma...@gmail.com> > > Sent: Tuesday, December 19, 2023 17:33 > > To: dev <dev@flink.apache.org> > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP > Scenarios > > > > Hi, there, > > > > We would like to start a discussion thread on "FLIP-403: High > > Availability Services for OLAP Scenarios"[1]. > > > > Currently, Flink's high availability service consists of two > > mechanisms: leader election/retrieval services for JobManager and > > persistent services for job metadata. However, these mechanisms are > > set up in an "all or nothing" manner. In OLAP scenarios, we typically > > only require leader election/retrieval services for JobManager > > components since jobs usually do not have a restart strategy. > > Additionally, the persistence of job states can negatively impact the > > cluster's throughput, especially for short query jobs. > > > > To address these issues, this FLIP proposes splitting the > > HighAvailabilityServices into LeaderServices and PersistentServices, > > and enable users to independently configure the high availability > > strategies specifically related to jobs. > > > > Please find more details in the FLIP wiki document [1]. Looking > > forward to your feedback. > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios > > > > Best, > > Yangze Guo >