Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Venkatakrishnan Sowrirajan Thu, 30 Jan 2025 19:10:52 -0800

Thanks for the FLIP, Allison. This will be a great feature addition to
fetch job archives from remote storage. Also decoupling the local cache
limits from the remote storage archive limits.


Few questions I have:

1. In terms of backwards compatibility, are you saying only if remote fetch
is enabled thats when the remote storage and local cache limits be
decoupled otherwise not?
2. Description of what historyserver.archive.remote-fetch-cached-jobs
config is meant for is very clear. Can you please clarify that in the FLIP?
Basically what I want to clarify is that there is no limit on how many
remote archives can be fetched but the above config is the local cache
limit of the most recently accessed jobs that can include both already
locally cached archive or a newly fetched remote archive, correct?

Looks like there are 2 new features or functionalities that are described.
We should decouple them.

1. Support to fetch job archives from remote storage. This is entirely a
new feature. No concerns with respect to backwards compatibility.
2. Introduce local archive cache limits which is decoupled from remote
archive cache limits. This is required to tackle the Flink HistoryServer
scaling issue due to local inode exhaustion. This looks to be a new feature
and improves the overall experience. But if the existing config
historyserver.archive.retained-jobs is modified to
historyserver.archive.cached-retained-jobs, then it won't be backwards
compatible with the older config. This should be clarified in the FLIP
clearly.

Thanks
Venkat

On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv <[email protected]> wrote:

> Thank you for your explanation. I have basically solved the previous
> questions.
>
> Regarding the second point, I would like to suggest clarifying the default
> values for newly adding parameters in `Public Interfaces` session.
>
> ---------- Forwarded message ---------
> 发件人： Allison <[email protected]>
> Date: 2025年1月30日周四 上午3:42
> Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> Improvements, Remote Data Store Fetch and Per Job Fetch
> To: <[email protected]>
>
>
> Hi Yanquan,
>
> Thanks for taking a look at this. Re: your questions:
>
> 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> the existing configuration of historyserver.archive.retained-jobs to
> historyserver.archive.cached-retained-jobs. The number of remote-jobs
> stored can be infinite, the thought behind this is that the remote data
> storage can be cleaned up or limited by a separate protocol that can be
> customized to each individual use case.
> 2. Could you clarify this a bit? I'm not sure I understand this part, do
> you mean to add what the configurations would be set to in the case of them
> not being defined to the FLIP?
> 3. historyserver.archive.fs.refresh-interval is the time duration between a
> call to the remote data storage to find fresh data. What it configures is
> how often the FHS polls the remote data store for new files. The remote
> data store is written to whenever a job is finished.
>
> Hope this clarifies some things.
>
> Best,
> - Allison
>
>
> On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <[email protected]> wrote:
>
> > Hi, Allison. Thanks for driving this FLIP.
> > I have some questions to confirm:
> >
> > 1. I can’t find any existed configuration name
> > `historyserver.archive.cached-retained-jobs`, I guess that what you mean
> is
> > modifing existing configuration from
> `historyserver.archive.retained-jobs`
> > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
> > the number of retained-jobs stored locally, is the number of
> retained-jobs
> > stored remotely infinite?
> > 2. I think it would be better to provide instructions for adding default
> > values to HistoryServerOptions.
> > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> and
> > remote storage simultaneously?
> >
> > Best,
> > Yanquan
> >
> > Allison <[email protected]> 于 2025年1月17日周五 上午8:07写道：
> >
> > > Hi everyone,
> > >
> > > I would like to initiate a discussion for the FLIP below, which
> enhances
> > to
> > > the Flink History Server to allow greater scalability of the service.
> > >
> > > Motivation:
> > >
> > > Currently, the Flink History Server (FHS) is limited in the number of
> job
> > > archives it can serve based on the storage capacity of the node that
> the
> > > FHS runs in. Job archives are stored locally in a cache which creates a
> > > local directory which is expanded out based on the contents of a single
> > > json archive file. This not only uses up local memory space, but also
> > > because of how the FHS expands the job archives into a nested directory
> > > structure, for jobs with a large number of taskmanagers or subtasks,
> > inode
> > > space often runs out.  In order to make the FHS more performant, we
> would
> > > like to introduce the ability to decouple the job archive storage for
> the
> > > FHS from being limited to the local cache, to being able to store and
> > fetch
> > > jobs archives from a remote file store.
> > >
> > > FLIP proposal document:
> > >
> > >
> >
>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!dDxa6ZtnqOQOh-MZudJbIFtJxYBO-Dc73ujAPM89F1wxWkL8MVzjAX4Q7tFDgjTZ03SsQU-bqBrKFPaK3znlq2A$
> > >
> > > Thanks!
> > >
> > > Best,
> > > - Allison Chang
> > >
> >
>

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Reply via email to