Thanks for driving this effort, Yangze! The proposal overall LGTM. Other from the throughput enhancement in the OLAP scenario, the separation of leader election/discovery services and the metadata persistence services will also make the HA impl clearer and easier to maintain. Just a minor comment on naming: would it better to rename PersistentServices to PersistenceServices, as usually we put a noun before Services?
Best, Zhanghao Chen ________________________________ From: Yangze Guo <karma...@gmail.com> Sent: Tuesday, December 19, 2023 17:33 To: dev <dev@flink.apache.org> Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios Hi, there, We would like to start a discussion thread on "FLIP-403: High Availability Services for OLAP Scenarios"[1]. Currently, Flink's high availability service consists of two mechanisms: leader election/retrieval services for JobManager and persistent services for job metadata. However, these mechanisms are set up in an "all or nothing" manner. In OLAP scenarios, we typically only require leader election/retrieval services for JobManager components since jobs usually do not have a restart strategy. Additionally, the persistence of job states can negatively impact the cluster's throughput, especially for short query jobs. To address these issues, this FLIP proposes splitting the HighAvailabilityServices into LeaderServices and PersistentServices, and enable users to independently configure the high availability strategies specifically related to jobs. Please find more details in the FLIP wiki document [1]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios Best, Yangze Guo