Thanks Yangze for starting this discussion. I have one comment: why do we
need to abstract two services as `LeaderServices` and
`PersistenceServices`?

>From the content, the purpose of this FLIP is to make job failover more
lightweight, so it would be more appropriate to abstract two services as
`ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead
of `LeaderServices` and `PersistenceServices` based on leader and store. In
this way, we can create a `JobHighAvailabilityService` that has a leader
service and store for the job that meets the requirements based on the
configuration in the zk/k8s high availability service.

WDYT?

Best,
Fang Yong

On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <xiangyu...@gmail.com> wrote:

> Thanks Yangze for restart this discussion.
>
> +1 for the overall idea. By splitting the HighAvailabilityServices into
> LeaderServices and PersistenceServices, we may support configuring
> different storage behind them in the future.
>
> We did run into real problems in production where too much job metadata was
> being stored on ZK, causing system instability.
>
>
> Yangze Guo <karma...@gmail.com> 于2023年12月29日周五 10:21写道:
>
> > Thanks for the response, Zhanghao.
> >
> > PersistenceServices sounds good to me.
> >
> > Best,
> > Yangze Guo
> >
> > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen
> > <zhanghao.c...@outlook.com> wrote:
> > >
> > > Thanks for driving this effort, Yangze! The proposal overall LGTM.
> Other
> > from the throughput enhancement in the OLAP scenario, the separation of
> > leader election/discovery services and the metadata persistence services
> > will also make the HA impl clearer and easier to maintain. Just a minor
> > comment on naming: would it better to rename PersistentServices to
> > PersistenceServices, as usually we put a noun before Services?
> > >
> > > Best,
> > > Zhanghao Chen
> > > ________________________________
> > > From: Yangze Guo <karma...@gmail.com>
> > > Sent: Tuesday, December 19, 2023 17:33
> > > To: dev <dev@flink.apache.org>
> > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP
> > Scenarios
> > >
> > > Hi, there,
> > >
> > > We would like to start a discussion thread on "FLIP-403: High
> > > Availability Services for OLAP Scenarios"[1].
> > >
> > > Currently, Flink's high availability service consists of two
> > > mechanisms: leader election/retrieval services for JobManager and
> > > persistent services for job metadata. However, these mechanisms are
> > > set up in an "all or nothing" manner. In OLAP scenarios, we typically
> > > only require leader election/retrieval services for JobManager
> > > components since jobs usually do not have a restart strategy.
> > > Additionally, the persistence of job states can negatively impact the
> > > cluster's throughput, especially for short query jobs.
> > >
> > > To address these issues, this FLIP proposes splitting the
> > > HighAvailabilityServices into LeaderServices and PersistentServices,
> > > and enable users to independently configure the high availability
> > > strategies specifically related to jobs.
> > >
> > > Please find more details in the FLIP wiki document [1]. Looking
> > > forward to your feedback.
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios
> > >
> > > Best,
> > > Yangze Guo
> >
>

Reply via email to