Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Allison Thu, 22 May 2025 13:27:30 -0700

Hi Becket,

Thank you for your feedback. I have updated the FLIP-505 proposal to
reflect these comments.


Would appreciate any additional feedback.

Best,
- Allison Chang

On Tue, May 13, 2025 at 10:27 AM Becket Qin <[email protected]> wrote:

> Thanks for the FLIP, Allison. The proposal makes a lot of sense in general.
> The history server is critical to the Flink batch.
>
> A few suggestions:
> 1. It might make sense to keep the existing config
> *historyserver.archive.retained-jobs*. This will only be used to determine
> the total number of jobs to keep in the remote storage.
> 2. The new configuration *historyserver.archive.cached-retained-jobs* only
> determines the number of jobs cached locally. The default value is -1. And
> the valid range is* [1, historyserver.archive.retained-jobs].* When not
> set, the it basically caches everything, which is the current behavior.
> When set, that basically means the history server is in the "partial
> caching mode" rather than the "full mirror mode".
> 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled
> *config.
> This config is a little confusing because the jobs history is fetched
> remotely even now. The difference is whether we fetch everything as a whole
> or fetch individual jobs on demand. But this isan  internal
> implementation detail and is not necessary to expose to the end users.
> 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs*
> always
> takes effect, regardless of whether the history server is running in a
> "full mirror mode" or "partial caching mode".
>
> So with the above settings:
> 1. By default, users get the same behavior as today.
> 2. When users set *historyserver.archive.cached-retained-jobs, *the history
> server enters the partial caching mode and fetches the jobs on demand.
> 3. Some most recently viewed jobs are automatically pinned in the cache so
> they will not be evicted accidentally and cause cache thrashing.
>
> BTW, It would be good to add a future work part to give a heads-up about
> the plan to use RocksDB for job history rather than raw files.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan <
> [email protected]> wrote:
>
> > > Regarding decoupling the two features, would your suggestion be to
> > separate
> > them into two separate FLIPs?
> >
> > Sorry for the late response.
> >
> > Yes, that is correct. If these 2 features are somewhat coupled with each
> > other, then it makes sense to address it in the same FLIP otherwise I
> think
> > it will be better to tackle it as 2 different FLIPs.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Mon, Mar 3, 2025 at 1:42 PM Allison <[email protected]> wrote:
> >
> > > Hi Yanquan,
> > >
> > > I've updated the FLIP to contain the default values, thanks for your
> > help!
> > >
> > > Sincerely
> > > - Allison
> > >
> > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv <[email protected]>
> wrote:
> > >
> > > > Thank you for your explanation. I have basically solved the previous
> > > > questions.
> > > >
> > > > Regarding the second point, I would like to suggest clarifying the
> > > default
> > > > values for newly adding parameters in `Public Interfaces` session.
> > > >
> > > > ---------- Forwarded message ---------
> > > > 发件人： Allison <[email protected]>
> > > > Date: 2025年1月30日周四 上午3:42
> > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > > > Improvements, Remote Data Store Fetch and Per Job Fetch
> > > > To: <[email protected]>
> > > >
> > > >
> > > > Hi Yanquan,
> > > >
> > > > Thanks for taking a look at this. Re: your questions:
> > > >
> > > > 1. Yes, I've updated the FLIP to be more clear, but it involves
> > modifying
> > > > the existing configuration of historyserver.archive.retained-jobs to
> > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > > > stored can be infinite, the thought behind this is that the remote
> data
> > > > storage can be cleaned up or limited by a separate protocol that can
> be
> > > > customized to each individual use case.
> > > > 2. Could you clarify this a bit? I'm not sure I understand this part,
> > do
> > > > you mean to add what the configurations would be set to in the case
> of
> > > them
> > > > not being defined to the FLIP?
> > > > 3. historyserver.archive.fs.refresh-interval is the time duration
> > > between a
> > > > call to the remote data storage to find fresh data. What it
> configures
> > is
> > > > how often the FHS polls the remote data store for new files. The
> remote
> > > > data store is written to whenever a job is finished.
> > > >
> > > > Hope this clarifies some things.
> > > >
> > > > Best,
> > > > - Allison
> > > >
> > > >
> > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <[email protected]>
> > wrote:
> > > >
> > > > > Hi, Allison. Thanks for driving this FLIP.
> > > > > I have some questions to confirm:
> > > > >
> > > > > 1. I can’t find any existed configuration name
> > > > > `historyserver.archive.cached-retained-jobs`, I guess that what you
> > > mean
> > > > is
> > > > > modifing existing configuration from
> > > > `historyserver.archive.retained-jobs`
> > > > > to `historyserver.archive.cached-retained-jobs`. If so, If we only
> > > limit
> > > > > the number of retained-jobs stored locally, is the number of
> > > > retained-jobs
> > > > > stored remotely infinite?
> > > > > 2. I think it would be better to provide instructions for adding
> > > default
> > > > > values to HistoryServerOptions.
> > > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both
> > local
> > > > and
> > > > > remote storage simultaneously?
> > > > >
> > > > > Best,
> > > > > Yanquan
> > > > >
> > > > > Allison <[email protected]> 于 2025年1月17日周五 上午8:07写道：
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I would like to initiate a discussion for the FLIP below, which
> > > > enhances
> > > > > to
> > > > > > the Flink History Server to allow greater scalability of the
> > service.
> > > > > >
> > > > > > Motivation:
> > > > > >
> > > > > > Currently, the Flink History Server (FHS) is limited in the
> number
> > of
> > > > job
> > > > > > archives it can serve based on the storage capacity of the node
> > that
> > > > the
> > > > > > FHS runs in. Job archives are stored locally in a cache which
> > > creates a
> > > > > > local directory which is expanded out based on the contents of a
> > > single
> > > > > > json archive file. This not only uses up local memory space, but
> > also
> > > > > > because of how the FHS expands the job archives into a nested
> > > directory
> > > > > > structure, for jobs with a large number of taskmanagers or
> > subtasks,
> > > > > inode
> > > > > > space often runs out.  In order to make the FHS more performant,
> we
> > > > would
> > > > > > like to introduce the ability to decouple the job archive storage
> > for
> > > > the
> > > > > > FHS from being limited to the local cache, to being able to store
> > and
> > > > > fetch
> > > > > > jobs archives from a remote file store.
> > > > > >
> > > > > > FLIP proposal document:
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Best,
> > > > > > - Allison Chang
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Reply via email to