Hi Becket, Thank you for your feedback. I have updated the FLIP-505 proposal to reflect these comments.
Would appreciate any additional feedback. Best, - Allison Chang On Tue, May 13, 2025 at 10:27 AM Becket Qin <becket....@gmail.com> wrote: > Thanks for the FLIP, Allison. The proposal makes a lot of sense in general. > The history server is critical to the Flink batch. > > A few suggestions: > 1. It might make sense to keep the existing config > *historyserver.archive.retained-jobs*. This will only be used to determine > the total number of jobs to keep in the remote storage. > 2. The new configuration *historyserver.archive.cached-retained-jobs* only > determines the number of jobs cached locally. The default value is -1. And > the valid range is* [1, historyserver.archive.retained-jobs].* When not > set, the it basically caches everything, which is the current behavior. > When set, that basically means the history server is in the "partial > caching mode" rather than the "full mirror mode". > 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled > *config. > This config is a little confusing because the jobs history is fetched > remotely even now. The difference is whether we fetch everything as a whole > or fetch individual jobs on demand. But this isan internal > implementation detail and is not necessary to expose to the end users. > 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs* > always > takes effect, regardless of whether the history server is running in a > "full mirror mode" or "partial caching mode". > > So with the above settings: > 1. By default, users get the same behavior as today. > 2. When users set *historyserver.archive.cached-retained-jobs, *the history > server enters the partial caching mode and fetches the jobs on demand. > 3. Some most recently viewed jobs are automatically pinned in the cache so > they will not be evicted accidentally and cause cache thrashing. > > BTW, It would be good to add a future work part to give a heads-up about > the plan to use RocksDB for job history rather than raw files. > > Thanks, > > Jiangjie (Becket) Qin > > > On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan < > vsowr...@asu.edu> wrote: > > > > Regarding decoupling the two features, would your suggestion be to > > separate > > them into two separate FLIPs? > > > > Sorry for the late response. > > > > Yes, that is correct. If these 2 features are somewhat coupled with each > > other, then it makes sense to address it in the same FLIP otherwise I > think > > it will be better to tackle it as 2 different FLIPs. > > > > Regards > > Venkata krishnan > > > > > > On Mon, Mar 3, 2025 at 1:42 PM Allison <achang5...@gmail.com> wrote: > > > > > Hi Yanquan, > > > > > > I've updated the FLIP to contain the default values, thanks for your > > help! > > > > > > Sincerely > > > - Allison > > > > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv <decq12y...@gmail.com> > wrote: > > > > > > > Thank you for your explanation. I have basically solved the previous > > > > questions. > > > > > > > > Regarding the second point, I would like to suggest clarifying the > > > default > > > > values for newly adding parameters in `Public Interfaces` session. > > > > > > > > ---------- Forwarded message --------- > > > > 发件人: Allison <achang5...@gmail.com> > > > > Date: 2025年1月30日周四 上午3:42 > > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > > > Improvements, Remote Data Store Fetch and Per Job Fetch > > > > To: <dev@flink.apache.org> > > > > > > > > > > > > Hi Yanquan, > > > > > > > > Thanks for taking a look at this. Re: your questions: > > > > > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves > > modifying > > > > the existing configuration of historyserver.archive.retained-jobs to > > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > > > stored can be infinite, the thought behind this is that the remote > data > > > > storage can be cleaned up or limited by a separate protocol that can > be > > > > customized to each individual use case. > > > > 2. Could you clarify this a bit? I'm not sure I understand this part, > > do > > > > you mean to add what the configurations would be set to in the case > of > > > them > > > > not being defined to the FLIP? > > > > 3. historyserver.archive.fs.refresh-interval is the time duration > > > between a > > > > call to the remote data storage to find fresh data. What it > configures > > is > > > > how often the FHS polls the remote data store for new files. The > remote > > > > data store is written to whenever a job is finished. > > > > > > > > Hope this clarifies some things. > > > > > > > > Best, > > > > - Allison > > > > > > > > > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <decq12y...@gmail.com> > > wrote: > > > > > > > > > Hi, Allison. Thanks for driving this FLIP. > > > > > I have some questions to confirm: > > > > > > > > > > 1. I can’t find any existed configuration name > > > > > `historyserver.archive.cached-retained-jobs`, I guess that what you > > > mean > > > > is > > > > > modifing existing configuration from > > > > `historyserver.archive.retained-jobs` > > > > > to `historyserver.archive.cached-retained-jobs`. If so, If we only > > > limit > > > > > the number of retained-jobs stored locally, is the number of > > > > retained-jobs > > > > > stored remotely infinite? > > > > > 2. I think it would be better to provide instructions for adding > > > default > > > > > values to HistoryServerOptions. > > > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both > > local > > > > and > > > > > remote storage simultaneously? > > > > > > > > > > Best, > > > > > Yanquan > > > > > > > > > > Allison <achang5...@gmail.com> 于 2025年1月17日周五 上午8:07写道: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I would like to initiate a discussion for the FLIP below, which > > > > enhances > > > > > to > > > > > > the Flink History Server to allow greater scalability of the > > service. > > > > > > > > > > > > Motivation: > > > > > > > > > > > > Currently, the Flink History Server (FHS) is limited in the > number > > of > > > > job > > > > > > archives it can serve based on the storage capacity of the node > > that > > > > the > > > > > > FHS runs in. Job archives are stored locally in a cache which > > > creates a > > > > > > local directory which is expanded out based on the contents of a > > > single > > > > > > json archive file. This not only uses up local memory space, but > > also > > > > > > because of how the FHS expands the job archives into a nested > > > directory > > > > > > structure, for jobs with a large number of taskmanagers or > > subtasks, > > > > > inode > > > > > > space often runs out. In order to make the FHS more performant, > we > > > > would > > > > > > like to introduce the ability to decouple the job archive storage > > for > > > > the > > > > > > FHS from being limited to the local cache, to being able to store > > and > > > > > fetch > > > > > > jobs archives from a remote file store. > > > > > > > > > > > > FLIP proposal document: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$ > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Best, > > > > > > - Allison Chang > > > > > > > > > > > > > > > > > > > > >