Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Becket Qin Thu, 29 May 2025 13:20:07 -0700

Hi Allison,

Thanks for updating the FLIP. The latest Looks good to me. I think we can
move forward to voting.


Thanks,

Jiangjie (Becket) Qin

On Thu, May 22, 2025 at 1:26 PM Allison <[email protected]> wrote:

> Hi Becket,
>
> Thank you for your feedback. I have updated the FLIP-505 proposal to
> reflect these comments.
>
> Would appreciate any additional feedback.
>
> Best,
> - Allison Chang
>
> On Tue, May 13, 2025 at 10:27 AM Becket Qin <[email protected]> wrote:
>
> > Thanks for the FLIP, Allison. The proposal makes a lot of sense in
> general.
> > The history server is critical to the Flink batch.
> >
> > A few suggestions:
> > 1. It might make sense to keep the existing config
> > *historyserver.archive.retained-jobs*. This will only be used to
> determine
> > the total number of jobs to keep in the remote storage.
> > 2. The new configuration *historyserver.archive.cached-retained-jobs*
> only
> > determines the number of jobs cached locally. The default value is -1.
> And
> > the valid range is* [1, historyserver.archive.retained-jobs].* When not
> > set, the it basically caches everything, which is the current behavior.
> > When set, that basically means the history server is in the "partial
> > caching mode" rather than the "full mirror mode".
> > 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled
> > *config.
> > This config is a little confusing because the jobs history is fetched
> > remotely even now. The difference is whether we fetch everything as a
> whole
> > or fetch individual jobs on demand. But this isan  internal
> > implementation detail and is not necessary to expose to the end users.
> > 4. The config
> *historyserver.archive.num-cached-most-recently-viewed-jobs*
> > always
> > takes effect, regardless of whether the history server is running in a
> > "full mirror mode" or "partial caching mode".
> >
> > So with the above settings:
> > 1. By default, users get the same behavior as today.
> > 2. When users set *historyserver.archive.cached-retained-jobs, *the
> history
> > server enters the partial caching mode and fetches the jobs on demand.
> > 3. Some most recently viewed jobs are automatically pinned in the cache
> so
> > they will not be evicted accidentally and cause cache thrashing.
> >
> > BTW, It would be good to add a future work part to give a heads-up about
> > the plan to use RocksDB for job history rather than raw files.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan <
> > [email protected]> wrote:
> >
> > > > Regarding decoupling the two features, would your suggestion be to
> > > separate
> > > them into two separate FLIPs?
> > >
> > > Sorry for the late response.
> > >
> > > Yes, that is correct. If these 2 features are somewhat coupled with
> each
> > > other, then it makes sense to address it in the same FLIP otherwise I
> > think
> > > it will be better to tackle it as 2 different FLIPs.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Mon, Mar 3, 2025 at 1:42 PM Allison <[email protected]> wrote:
> > >
> > > > Hi Yanquan,
> > > >
> > > > I've updated the FLIP to contain the default values, thanks for your
> > > help!
> > > >
> > > > Sincerely
> > > > - Allison
> > > >
> > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv <[email protected]>
> > wrote:
> > > >
> > > > > Thank you for your explanation. I have basically solved the
> previous
> > > > > questions.
> > > > >
> > > > > Regarding the second point, I would like to suggest clarifying the
> > > > default
> > > > > values for newly adding parameters in `Public Interfaces` session.
> > > > >
> > > > > ---------- Forwarded message ---------
> > > > > 发件人： Allison <[email protected]>
> > > > > Date: 2025年1月30日周四 上午3:42
> > > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > > > > Improvements, Remote Data Store Fetch and Per Job Fetch
> > > > > To: <[email protected]>
> > > > >
> > > > >
> > > > > Hi Yanquan,
> > > > >
> > > > > Thanks for taking a look at this. Re: your questions:
> > > > >
> > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves
> > > modifying
> > > > > the existing configuration of historyserver.archive.retained-jobs
> to
> > > > > historyserver.archive.cached-retained-jobs. The number of
> remote-jobs
> > > > > stored can be infinite, the thought behind this is that the remote
> > data
> > > > > storage can be cleaned up or limited by a separate protocol that
> can
> > be
> > > > > customized to each individual use case.
> > > > > 2. Could you clarify this a bit? I'm not sure I understand this
> part,
> > > do
> > > > > you mean to add what the configurations would be set to in the case
> > of
> > > > them
> > > > > not being defined to the FLIP?
> > > > > 3. historyserver.archive.fs.refresh-interval is the time duration
> > > > between a
> > > > > call to the remote data storage to find fresh data. What it
> > configures
> > > is
> > > > > how often the FHS polls the remote data store for new files. The
> > remote
> > > > > data store is written to whenever a job is finished.
> > > > >
> > > > > Hope this clarifies some things.
> > > > >
> > > > > Best,
> > > > > - Allison
> > > > >
> > > > >
> > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi, Allison. Thanks for driving this FLIP.
> > > > > > I have some questions to confirm:
> > > > > >
> > > > > > 1. I can’t find any existed configuration name
> > > > > > `historyserver.archive.cached-retained-jobs`, I guess that what
> you
> > > > mean
> > > > > is
> > > > > > modifing existing configuration from
> > > > > `historyserver.archive.retained-jobs`
> > > > > > to `historyserver.archive.cached-retained-jobs`. If so, If we
> only
> > > > limit
> > > > > > the number of retained-jobs stored locally, is the number of
> > > > > retained-jobs
> > > > > > stored remotely infinite?
> > > > > > 2. I think it would be better to provide instructions for adding
> > > > default
> > > > > > values to HistoryServerOptions.
> > > > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both
> > > local
> > > > > and
> > > > > > remote storage simultaneously?
> > > > > >
> > > > > > Best,
> > > > > > Yanquan
> > > > > >
> > > > > > Allison <[email protected]> 于 2025年1月17日周五 上午8:07写道：
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I would like to initiate a discussion for the FLIP below, which
> > > > > enhances
> > > > > > to
> > > > > > > the Flink History Server to allow greater scalability of the
> > > service.
> > > > > > >
> > > > > > > Motivation:
> > > > > > >
> > > > > > > Currently, the Flink History Server (FHS) is limited in the
> > number
> > > of
> > > > > job
> > > > > > > archives it can serve based on the storage capacity of the node
> > > that
> > > > > the
> > > > > > > FHS runs in. Job archives are stored locally in a cache which
> > > > creates a
> > > > > > > local directory which is expanded out based on the contents of
> a
> > > > single
> > > > > > > json archive file. This not only uses up local memory space,
> but
> > > also
> > > > > > > because of how the FHS expands the job archives into a nested
> > > > directory
> > > > > > > structure, for jobs with a large number of taskmanagers or
> > > subtasks,
> > > > > > inode
> > > > > > > space often runs out.  In order to make the FHS more
> performant,
> > we
> > > > > would
> > > > > > > like to introduce the ability to decouple the job archive
> storage
> > > for
> > > > > the
> > > > > > > FHS from being limited to the local cache, to being able to
> store
> > > and
> > > > > > fetch
> > > > > > > jobs archives from a remote file store.
> > > > > > >
> > > > > > > FLIP proposal document:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Best,
> > > > > > > - Allison Chang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

Reply via email to