Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Allison Mon, 03 Mar 2025 13:30:56 -0800

Hi Venkat,

To reply to your questions:
1. Correct, only if remote fetch is enabled as a configuration, will the
remote storage and local cache limits be decoupled. Otherwise, the system
will behave as previously.
2. I've clarified the description in the FLIP.


Regarding decoupling the two features, would your suggestion be to separate
them into two separate FLIPs?

Thank you for your feedback.
Best,
- Allison

On Thu, Jan 30, 2025 at 7:10 PM Venkatakrishnan Sowrirajan <vsowr...@asu.edu>
wrote:

> Thanks for the FLIP, Allison. This will be a great feature addition to
> fetch job archives from remote storage. Also decoupling the local cache
> limits from the remote storage archive limits.
>
> Few questions I have:
>
> 1. In terms of backwards compatibility, are you saying only if remote fetch
> is enabled thats when the remote storage and local cache limits be
> decoupled otherwise not?
> 2. Description of what historyserver.archive.remote-fetch-cached-jobs
> config is meant for is very clear. Can you please clarify that in the FLIP?
> Basically what I want to clarify is that there is no limit on how many
> remote archives can be fetched but the above config is the local cache
> limit of the most recently accessed jobs that can include both already
> locally cached archive or a newly fetched remote archive, correct?
>
> Looks like there are 2 new features or functionalities that are described.
> We should decouple them.
>
> 1. Support to fetch job archives from remote storage. This is entirely a
> new feature. No concerns with respect to backwards compatibility.
> 2. Introduce local archive cache limits which is decoupled from remote
> archive cache limits. This is required to tackle the Flink HistoryServer
> scaling issue due to local inode exhaustion. This looks to be a new feature
> and improves the overall experience. But if the existing config
> historyserver.archive.retained-jobs is modified to
> historyserver.archive.cached-retained-jobs, then it won't be backwards
> compatible with the older config. This should be clarified in the FLIP
> clearly.
>
> Thanks
> Venkat
>
> On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv <decq12y...@gmail.com> wrote:
>
> > Thank you for your explanation. I have basically solved the previous
> > questions.
> >
> > Regarding the second point, I would like to suggest clarifying the
> default
> > values for newly adding parameters in `Public Interfaces` session.
> >
> > ---------- Forwarded message ---------
> > 发件人： Allison <achang5...@gmail.com>
> > Date: 2025年1月30日周四 上午3:42
> > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > Improvements, Remote Data Store Fetch and Per Job Fetch
> > To: <dev@flink.apache.org>
> >
> >
> > Hi Yanquan,
> >
> > Thanks for taking a look at this. Re: your questions:
> >
> > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> > the existing configuration of historyserver.archive.retained-jobs to
> > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > stored can be infinite, the thought behind this is that the remote data
> > storage can be cleaned up or limited by a separate protocol that can be
> > customized to each individual use case.
> > 2. Could you clarify this a bit? I'm not sure I understand this part, do
> > you mean to add what the configurations would be set to in the case of
> them
> > not being defined to the FLIP?
> > 3. historyserver.archive.fs.refresh-interval is the time duration
> between a
> > call to the remote data storage to find fresh data. What it configures is
> > how often the FHS polls the remote data store for new files. The remote
> > data store is written to whenever a job is finished.
> >
> > Hope this clarifies some things.
> >
> > Best,
> > - Allison
> >
> >
> > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <decq12y...@gmail.com> wrote:
> >
> > > Hi, Allison. Thanks for driving this FLIP.
> > > I have some questions to confirm:
> > >
> > > 1. I can’t find any existed configuration name
> > > `historyserver.archive.cached-retained-jobs`, I guess that what you
> mean
> > is
> > > modifing existing configuration from
> > `historyserver.archive.retained-jobs`
> > > to `historyserver.archive.cached-retained-jobs`. If so, If we only
> limit
> > > the number of retained-jobs stored locally, is the number of
> > retained-jobs
> > > stored remotely infinite?
> > > 2. I think it would be better to provide instructions for adding
> default
> > > values to HistoryServerOptions.
> > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> > and
> > > remote storage simultaneously?
> > >
> > > Best,
> > > Yanquan
> > >
> > > Allison <achang5...@gmail.com> 于 2025年1月17日周五 上午8:07写道：
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to initiate a discussion for the FLIP below, which
> > enhances
> > > to
> > > > the Flink History Server to allow greater scalability of the service.
> > > >
> > > > Motivation:
> > > >
> > > > Currently, the Flink History Server (FHS) is limited in the number of
> > job
> > > > archives it can serve based on the storage capacity of the node that
> > the
> > > > FHS runs in. Job archives are stored locally in a cache which
> creates a
> > > > local directory which is expanded out based on the contents of a
> single
> > > > json archive file. This not only uses up local memory space, but also
> > > > because of how the FHS expands the job archives into a nested
> directory
> > > > structure, for jobs with a large number of taskmanagers or subtasks,
> > > inode
> > > > space often runs out.  In order to make the FHS more performant, we
> > would
> > > > like to introduce the ability to decouple the job archive storage for
> > the
> > > > FHS from being limited to the local cache, to being able to store and
> > > fetch
> > > > jobs archives from a remote file store.
> > > >
> > > > FLIP proposal document:
> > > >
> > > >
> > >
> >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!dDxa6ZtnqOQOh-MZudJbIFtJxYBO-Dc73ujAPM89F1wxWkL8MVzjAX4Q7tFDgjTZ03SsQU-bqBrKFPaK3znlq2A$
> > > >
> > > > Thanks!
> > > >
> > > > Best,
> > > > - Allison Chang
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Reply via email to