from:"Rui Fan"

Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren

2023-04-22 Thread Rui Fan

Congratulations, Qingsheng!

Best,
Rui Fan

On Sun, Apr 23, 2023 at 10:23 AM Benchao Li  wrote:

> Congratulations, Qingsheng!
>
> yuxia  于2023年4月23日周日 09:24写道：
>
> > Congratulations, Qingsheng!
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "lincoln 86xy" 
> > 收件人: "dev" 
> > 发送时间: 星期日, 2023年 4 月 23日 上午 9:11:07
> > 主题: Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren
> >
> > Congratulations, Qingsheng!
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Yuxin Tan  于2023年4月22日周六 11:57写道：
> >
> > > Congratulations, Qingsheng!
> > >
> > > Best,
> > > Yuxin
> > >
> > >
> > > Ahmed Hamdy  于2023年4月22日周六 02:20写道：
> > >
> > > > Congratulations Qingsheng.
> > > > Best regards
> > > > Ahmed
> > > >
> > > > On Fri, 21 Apr 2023 at 17:22, Samrat Deb 
> > wrote:
> > > >
> > > > > congratulations !
> > > > >
> > > > > On Fri, 21 Apr 2023 at 9:45 PM, David Morávek 
> > wrote:
> > > > >
> > > > > > Congratulations, Qingsheng, well deserved!
> > > > > >
> > > > > > Best,
> > > > > > D.
> > > > > >
> > > > > > On Fri 21. 4. 2023 at 16:41, Feng Jin 
> > wrote:
> > > > > >
> > > > > > > Congratulations, Qingsheng
> > > > > > >
> > > > > > >
> > > > > > > 
> > > > > > > Best,
> > > > > > > Feng Jin
> > > > > > >
> > > > > > > On Fri, Apr 21, 2023 at 8:39 PM Mang Zhang  >
> > > > wrote:
> > > > > > >
> > > > > > > > Congratulations, Qingsheng.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Mang Zhang
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > At 2023-04-21 19:50:02, "Jark Wu"  wrote:
> > > > > > > > >Hi everyone,
> > > > > > > > >
> > > > > > > > >We are thrilled to announce that Qingsheng Ren has joined
> the
> > > > Flink
> > > > > > PMC!
> > > > > > > > >
> > > > > > > > >Qingsheng has been contributing to Apache Flink for a long
> > time.
> > > > He
> > > > > is
> > > > > > > the
> > > > > > > > >core contributor and maintainer of the Kafka connector and
> > > > > > > > >flink-cdc-connectors, bringing users stability and ease of
> use
> > > in
> > > > > both
> > > > > > > > >projects. He drove discussions and implementations in
> > FLIP-221,
> > > > > > > FLIP-288,
> > > > > > > > >and the connector testing framework. He is continuously
> > helping
> > > > with
> > > > > > the
> > > > > > > > >expansion of the Flink community and has given several talks
> > > about
> > > > > > Flink
> > > > > > > > >connectors at many conferences, such as Flink Forward Global
> > and
> > > > > Flink
> > > > > > > > >Forward Asia. Besides that, he is willing to help a lot in
> the
> > > > > > community
> > > > > > > > >work, such as being the release manager for both 1.17 and
> > 1.18,
> > > > > > > verifying
> > > > > > > > >releases, and answering questions on the mailing list.
> > > > > > > > >
> > > > > > > > >Congratulations and welcome Qingsheng!
> > > > > > > > >
> > > > > > > > >Best,
> > > > > > > > >Jark (on behalf of the Flink PMC)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>

Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu

2023-04-22 Thread Rui Fan

Congratulations, Leonard!

Best,
Rui Fan

On Sun, Apr 23, 2023 at 10:24 AM Benchao Li  wrote:

> Congratulations, Leonard!
>
> Lincoln Lee  于2023年4月23日周日 09:12写道：
>
> > Congratulations, Leonard!
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Yuxin Tan  于2023年4月22日周六 11:57写道：
> >
> > > Congratulations, Leonard!
> > >
> > > Best,
> > > Yuxin
> > >
> > >
> > > Panagiotis Garefalakis  于2023年4月22日周六 08:08写道：
> > >
> > > > Congrats Leonard!
> > > >
> > > > On Fri, Apr 21, 2023 at 11:19 AM Ahmed Hamdy 
> > > wrote:
> > > >
> > > > > Congratulations Leonard.
> > > > > Best Regards
> > > > > Ahmed
> > > > >
> > > > > On Fri, 21 Apr 2023 at 17:23, Samrat Deb 
> > > wrote:
> > > > >
> > > > > > congratulations
> > > > > >
> > > > > > On Fri, 21 Apr 2023 at 9:44 PM, David Morávek 
> > > wrote:
> > > > > >
> > > > > > > Congratulations, Leonard, well deserved!
> > > > > > >
> > > > > > > Best,
> > > > > > > D.
> > > > > > >
> > > > > > > On Fri 21. 4. 2023 at 16:40, Feng Jin 
> > > wrote:
> > > > > > >
> > > > > > > > Congratulations, Leonard
> > > > > > > >
> > > > > > > >
> > > > > > > > 
> > > > > > > > Best,
> > > > > > > > Feng Jin
> > > > > > > >
> > > > > > > > On Fri, Apr 21, 2023 at 8:38 PM Mang Zhang <
> zhangma...@163.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Congratulations, Leonard.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Mang Zhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > At 2023-04-21 19:47:52, "Jark Wu" 
> wrote:
> > > > > > > > > >Hi everyone,
> > > > > > > > > >
> > > > > > > > > >We are thrilled to announce that Leonard Xu has joined the
> > > Flink
> > > > > > PMC!
> > > > > > > > > >
> > > > > > > > > >Leonard has been an active member of the Apache Flink
> > > community
> > > > > for
> > > > > > > many
> > > > > > > > > >years and became a committer in Nov 2021. He has been
> > involved
> > > > in
> > > > > > > > various
> > > > > > > > > >areas of the project, from code contributions to community
> > > > > building.
> > > > > > > His
> > > > > > > > > >contributions are mainly focused on Flink SQL and
> > connectors,
> > > > > > > especially
> > > > > > > > > >leading the flink-cdc-connectors project to receive 3.8+K
> > > GitHub
> > > > > > > stars.
> > > > > > > > He
> > > > > > > > > >authored 150+ PRs, and reviewed 250+ PRs, and drove
> several
> > > > FLIPs
> > > > > > > (e.g.,
> > > > > > > > > >FLIP-132, FLIP-162). He has participated in plenty of
> > > > discussions
> > > > > in
> > > > > > > the
> > > > > > > > > >dev mailing list, answering questions about 500+ threads
> in
> > > the
> > > > > > > > > >user/user-zh mailing list. Besides that, he is community
> > > minded,
> > > > > > such
> > > > > > > as
> > > > > > > > > >being the release manager of 1.17, verifying releases,
> > > managing
> > > > > > > release
> > > > > > > > > >syncs, etc.
> > > > > > > > > >
> > > > > > > > > >Congratulations and welcome Leonard!
> > > > > > > > > >
> > > > > > > > > >Best,
> > > > > > > > > >Jark (on behalf of the Flink PMC)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>

Re: [VOTE] FLIP-288: Enable Dynamic Partition Discovery by Default in Kafka Source

2023-04-24 Thread Rui Fan

+1 (binding)

Best,
Rui Fan

On Tue, Apr 25, 2023 at 10:06 AM Biao Geng  wrote:

> +1 (non-binding)
> Best,
> Biao Geng
>
> Martijn Visser  于2023年4月24日周一 20:20写道：
>
> > +1 (binding)
> >
> > On Mon, Apr 24, 2023 at 4:10 AM Feng Jin  wrote:
> >
> > > +1 (non-binding)
> > >
> > >
> > > Best,
> > > Feng
> > >
> > > On Mon, Apr 24, 2023 at 9:55 AM Hang Ruan 
> > wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > Best,
> > > > Hang
> > > >
> > > > Paul Lam  于2023年4月23日周日 11:58写道：
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Best,
> > > > > Paul Lam
> > > > >
> > > > > > 2023年4月23日 10:57，Shammon FY  写道：
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Best,
> > > > > > Shammon FY
> > > > > >
> > > > > > On Sun, Apr 23, 2023 at 10:35 AM Qingsheng Ren <
> renqs...@gmail.com
> > > > > <mailto:renqs...@gmail.com>> wrote:
> > > > > >
> > > > > >> Thanks for pushing this FLIP forward, Hongshun!
> > > > > >>
> > > > > >> +1 (binding)
> > > > > >>
> > > > > >> Best,
> > > > > >> Qingsheng
> > > > > >>
> > > > > >> On Fri, Apr 21, 2023 at 2:52 PM Hongshun Wang <
> > > > loserwang1...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Dear Flink Developers,
> > > > > >>>
> > > > > >>>
> > > > > >>> Thank you for providing feedback on FLIP-288: Enable Dynamic
> > > > Partition
> > > > > >>> Discovery by Default in Kafka Source[1] on the discussion
> > > thread[2].
> > > > > >>>
> > > > > >>> The goal of the FLIP is to enable partition discovery by
> default
> > > and
> > > > > set
> > > > > >>> EARLIEST offset strategy for later discovered partitions.
> > > > > >>>
> > > > > >>>
> > > > > >>> I am initiating a vote for this FLIP. The vote will be open for
> > at
> > > > > least
> > > > > >> 72
> > > > > >>> hours, unless there is an objection or insufficient votes.
> > > > > >>>
> > > > > >>>
> > > > > >>> [1]: [
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source](https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)>
> > <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> >
> > > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > >
> > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > > >
> > > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > > > >
> > > > > >> <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > > > > >
> > > > > >>> <
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source%5D(https://cwiki.apache.org/confluence/display/FLINK/FLIP-288%3A+Enable+Dynamic+Partition+Discovery+by+Default+in+Kafka+Source)
> > > > > >
> > > > > >>>
> > > > > >>> [2]: [
> > > > > >>>
> > > > > >>>
> > > > > >>
> https://lists.apache.org/thread/581f2xq5d1tlwc8gcr27gwkp3zp0wrg6]
> > <
> > > > > https://lists.apache.org/thread/581f2xq5d1tlwc8gcr27gwkp3zp0wrg6
> ]>(
> > > > > https://lists.apache.org/thread/581f2xq5d1tlwc8gcr27gwkp3zp0wrg6 <
> > > > > https://lists.apache.org/thread/581f2xq5d1tlwc8gcr27gwkp3zp0wrg6>)
> > > > > >>>
> > > > > >>>
> > > > > >>> Best regards,
> > > > > >>> Hongshun
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-306: Unified File Merging Mechanism for Checkpoints

2023-05-04 Thread Rui Fan

Hi Zakelly,

Sorry for the late reply, I still have some minor questions.

>> (3) When rescaling, do all shared files need to be copied?
>
> I agree with you that only sst files of the base DB need to be copied
> (or re-uploaded in the next checkpoint). However, section 4.2
> simplifies file copying issues (copying all files), following the
> concept of shared state.

Maybe re-uploaded in the next checkpoint is also a general solution
for shared state? If yes, could we consider it as an optimization?
And we can do it after the FLIP is done.

>> (5) How many physical files can a TM write at the same checkpoint at the
same time?
>
> This is a very good point. Actually, there is a file reuse pool as
> section 4.6 described. There could be multiple files within this pool,
> supporting concurrent writing by multiple writers. I suggest providing
> two configurations to control the file number:
>
>   state.checkpoints.file-merging.max-file-pool-size: Specifies the
> upper limit of the file pool size.
>   state.checkpoints.file-merging.max-subtasks-per-file: Specifies the
> lower limit of the file pool size based on the number of subtasks
> within each TM.
>
> The number of simultaneously open files is controlled by these two
> options, and the first option takes precedence over the second.

I'm not sure why we need 2 configurations, or whether 1 configuration
is enough here.

The `max-file-pool-size` is hard to give a default value. For jobs with
many tasks in a TM, it may be useful for file merging. However,
it doesn't work well for jobs with a small number of tasks in a TM.

I prefer just adding the `max-file-pool-size`, and
the `pool size = number of tasks / max-file-pool-size`. WDYT?

Maybe I missed some information. Please correct me if I'm wrong, thanks.

Best,
Rui Fan

On Fri, Apr 28, 2023 at 12:10 AM Zakelly Lan  wrote:

> Hi all,
>
> Thanks for all the feedback so far.
>
> The discussion has been going on for some time, and all the comments
> and suggestions are addressed. So I would like to start a vote on this
> FLIP, which begins a week later (May. 5th at 10:00 AM GMT).
>
> If you have any concerns, please don't hesitate to follow up on this
> discussion.
>
>
> Best regards,
> Zakelly
>
> On Fri, Apr 28, 2023 at 12:03 AM Zakelly Lan 
> wrote:
> >
> > Hi Yuan,
> >
> > Thanks for sharing your thoughts. Like you said, the code changes and
> > complexities are shaded in the newly introduced file management in TM,
> > while the old file management remains the same. It is safe for us to
> > take a small step towards decentralized file management in this way. I
> > put the POC branch here[1] so everyone can check the code change.
> >
> > Best regards,
> > Zakelly
> >
> > [1] https://github.com/Zakelly/flink/tree/flip306_poc
> >
> > On Thu, Apr 27, 2023 at 8:13 PM Yuan Mei  wrote:
> > >
> > > Hey all,
> > >
> > > Thanks @Zakelly for driving this effort and thanks everyone for the
> warm
> > > discussion. Sorry for the late response.
> > >
> > > As I and Zakelly have already discussed and reviewed the design
> carefully
> > > when drafting this FLIP, I do not have additional inputs here. But I
> want
> > > to highlight several points that I've been quoted and explain why I
> think
> > > the current design is a reasonable and clean one.
> > >
> > > *Why this FLIP is proposed*
> > > File Flooding is a problem for Flink I've seen many people bring up
> > > throughout the years, especially for large clusters. Unfortunately,
> there
> > > are not yet accepted solutions for the most commonly used state backend
> > > like RocksDB. This FLIP was originally targeted to address merging
> > > SST(KeyedState) checkpoint files.
> > >
> > > While we are comparing different design choices, we found that
> different
> > > types of checkpoint files (OPState, Unaligned CP channel state,
> Changelog
> > > incremental state) share similar considerations, for example, file
> > > management, file merging granularity, and e.t.c. That's why we want to
> > > abstract a unified framework for merging these different types of
> > > checkpoint files and provide flexibility to choose between merging
> > > efficiency, rescaling/restoring cost, File system capabilities
> (affecting
> > > File visibility), and e.t.c.
> > >
> > > *File Ownership moved from JM to TM, WHY*
> > > One of the major differences in the proposed design is moving file
> > > ownership from JM to TM. A lot of questions/concerns are coming

Re: [DISCUSS] FLIP-306: Unified File Merging Mechanism for Checkpoints

2023-05-05 Thread Rui Fan

Hi Zakelly,

Thanks for the clarification!

Currently, I understand what you mean, and LGTM.

Best,
Rui Fan

On Fri, May 5, 2023 at 12:27 PM Zakelly Lan  wrote:

> Hi all,
>
> @Yun Tang and I have an offline discussion, and we agreed that:
>
> 1. The design of this FLIP is pretty much like the option 3 in design
> doc[1] for FLINK-23342, and it is almost the best solution in general.
> Based on our production experience, this FLIP can solve the file flood
> problem very well.
> 2. There is a corner case that the directory may be left over when the
> job stops, so I added some content in section 4.8.
>
> Best,
> Zakelly
>
>
> [1]
> https://docs.google.com/document/d/1NJJQ30P27BmUvD7oa4FChvkYxMEgjRPTVdO1dHLl_9I/edit#
>
> On Fri, May 5, 2023 at 11:19 AM Zakelly Lan  wrote:
> >
> > Hi Rui Fan,
> >
> > Thanks for your reply.
> >
> > > Maybe re-uploaded in the next checkpoint is also a general solution
> > > for shared state? If yes, could we consider it as an optimization?
> > > And we can do it after the FLIP is done.
> >
> > Yes, it is a general solution for shared states. Maybe in the first
> > version we can let the shared states not re-use any previous state
> > handle after restoring, thus the state backend will do a full snapshot
> > and re-uploading the files it needs. This could cover the scenario
> > that rocksdb only uploads the base DB files. And later we could
> > consider performing fast copy in DFS to optimize the re-uploading.
> > WDYT?
> >
> >
> > > I'm not sure why we need 2 configurations, or whether 1 configuration
> > > is enough here.
> >
> > > The `max-file-pool-size` is hard to give a default value. For jobs with
> > > many tasks in a TM, it may be useful for file merging. However,
> > > it doesn't work well for jobs with a small number of tasks in a TM.
> >
> > > I prefer just adding the `max-file-pool-size`, and
> > > the `pool size = number of tasks / max-file-pool-size`. WDYT?
> >
> >
> > Sorry for not explaining clearly. The value of pool size is calculated
> by:
> >
> > 1. pool size = number of tasks / max-subtasks-per-file
> > 2. if pool size > max-file-pool-size then pool size = max-file-pool-size
> >
> > The `max-subtasks-per-file` addresses the issue of sequential file
> > writing, while the `max-file-pool-size` acts as a safeguard to prevent
> > an excessively large file pool. WDYT?
> >
> >
> > Thanks again for your thoughts.
> >
> > Best,
> > Zakelly
> >
> > On Thu, May 4, 2023 at 3:52 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > Hi Zakelly,
> > >
> > > Sorry for the late reply, I still have some minor questions.
> > >
> > > >> (3) When rescaling, do all shared files need to be copied?
> > > >
> > > > I agree with you that only sst files of the base DB need to be copied
> > > > (or re-uploaded in the next checkpoint). However, section 4.2
> > > > simplifies file copying issues (copying all files), following the
> > > > concept of shared state.
> > >
> > > Maybe re-uploaded in the next checkpoint is also a general solution
> > > for shared state? If yes, could we consider it as an optimization?
> > > And we can do it after the FLIP is done.
> > >
> > > >> (5) How many physical files can a TM write at the same checkpoint
> at the
> > > same time?
> > > >
> > > > This is a very good point. Actually, there is a file reuse pool as
> > > > section 4.6 described. There could be multiple files within this
> pool,
> > > > supporting concurrent writing by multiple writers. I suggest
> providing
> > > > two configurations to control the file number:
> > > >
> > > >   state.checkpoints.file-merging.max-file-pool-size: Specifies the
> > > > upper limit of the file pool size.
> > > >   state.checkpoints.file-merging.max-subtasks-per-file: Specifies the
> > > > lower limit of the file pool size based on the number of subtasks
> > > > within each TM.
> > > >
> > > > The number of simultaneously open files is controlled by these two
> > > > options, and the first option takes precedence over the second.
> > >
> > > I'm not sure why we need 2 configurations, or whether 1 configuration
> > > is enough here.
> > >
> > > The `max-file-pool-size` is hard to give a default value. For jobs with
> > >

Re: [VOTE] FLIP-306: Unified File Merging Mechanism for Checkpoints

2023-05-09 Thread Rui Fan

Thanks for driving this proposal, Zakelly.

+1(binding)

Best,
Rui Fan

On Wed, May 10, 2023 at 11:04 AM Hangxiang Yu  wrote:

> Hi Zakelly.
> Thanks for driving this.
> +1 (no-binding)
>
> On Wed, May 10, 2023 at 10:52 AM Yuan Mei  wrote:
>
> > Thanks for driving this, Zakelly.
> >
> > As discussed in the thread,
> >
> > +1 for the proposal (binding)
> >
> > Best,
> >
> > Yuan
> >
> >
> >
> > On Wed, May 10, 2023 at 10:39 AM Zakelly Lan 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Sorry for the 4 duplicate emails. There was a problem with the dev
> > > mailing list blocking the mails from Gmail. I thought it was a network
> > > problem so I tried several times. The issue is addressed by
> > > INFRA-24572[1] and the piled up mails are delivered at one time.
> > >
> > > Based on the sending time, the vote will be open until May 12th at
> > > 11:00PM GMT. Please discuss and vote in the last thread (this one).
> > > Thanks!
> > >
> > >
> > > Best regards,
> > > Zakelly
> > >
> > > [1] https://issues.apache.org/jira/browse/INFRA-24572
> > >
> > > On Wed, May 10, 2023 at 10:30 AM Yanfei Lei 
> wrote:
> > > >
> > > > +1 (no-binding)
> > > >
> > > > Best,
> > > > Yanfei
> > > >
> > > >
> > > > Jing Ge  于2023年5月10日周三 07:03写道：
> > > >
> > > > >
> > > > > Hi Zakelly,
> > > > >
> > > > > I saw you sent at least 4 same emails for voting FLIP-306. I guess
> > > this one
> > > > > should be the last one and the right one for us to vote right? BTW,
> > > based
> > > > > on the sending time, 72 hours means to open the discussion until
> May
> > > 12th.
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Tue, May 9, 2023 at 8:24 PM Zakelly Lan 
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Thanks for all the feedback for FLIP-306: Unified File Merging
> > > > > > Mechanism for Checkpoints[1] on the discussion thread[2].
> > > > > >
> > > > > > I'd like to start a vote for it. The vote will be open for at
> least
> > > 72
> > > > > > hours (until May 11th, 12:00AM GMT) unless there is an objection
> or
> > > an
> > > > > > insufficient number of votes.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Zakelly
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints
> > > > > > [2]
> > https://lists.apache.org/thread/56px3kvn3tlwpc7sl12kx6notfmk9g7q
> > > > > >
> > >
> >
>
>
> --
> Best,
> Hangxiang.
>

Re: [VOTE] FLIP-305: Support atomic for CREATE TABLE AS SELECT(CTAS) statement

2023-06-12 Thread Rui Fan

+1 (binding)

Best,
Rui Fan

On Mon, Jun 12, 2023 at 19:58 liu ron  wrote:

> +1 (no-binding)
>
> Best,
> Ron
>
> Jing Ge  于2023年6月12日周一 19:33写道：
>
> > +1(binding) Thanks!
> >
> > Best regards,
> > Jing
> >
> > On Mon, Jun 12, 2023 at 12:01 PM yuxia 
> > wrote:
> >
> > > +1 (binding)
> > > Thanks Mang driving it.
> > >
> > > Best regards,
> > > Yuxia
> > >
> > > - 原始邮件 -
> > > 发件人: "zhangmang1" 
> > > 收件人: "dev" 
> > > 发送时间: 星期一, 2023年 6 月 12日 下午 5:31:10
> > > 主题: [VOTE] FLIP-305: Support atomic for CREATE TABLE AS SELECT(CTAS)
> > > statement
> > >
> > > Hi everyone,
> > >
> > > Thanks for all the feedback about FLIP-305: Support atomic for CREATE
> > > TABLE AS SELECT(CTAS) statement[1].
> > > [2] is the discussion thread.
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours (until June 15th, 10:00AM GMT) unless there is an objection or an
> > > insufficient number of votes.[1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> > > [2]https://lists.apache.org/thread/n6nsvbwhs5kwlj5kjgv24by2tk5mh9xd
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Mang Zhang
> > >
> >
>

Re: [VOTE] FLIP-311: Support Call Stored Procedure

2023-06-12 Thread Rui Fan

+1 (binding)

Best,
Rui Fan

On Mon, Jun 12, 2023 at 22:20 Benchao Li  wrote:

> +1 (binding)
>
> yuxia  于2023年6月12日周一 17:58写道：
>
> > Hi everyone,
> > Thanks for all the feedback about FLIP-311: Support Call Stored
> > Procedure[1]. Based on the discussion [2], we have come to a consensus,
> so
> > I would like to start a vote.
> > The vote will be open for at least 72 hours (until June 15th, 10:00AM
> GMT)
> > unless there is an objection or an insufficient number of votes.
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-311%3A+Support+Call+Stored+Procedure
> > [2] https://lists.apache.org/thread/k6s50gcgznon9v1oylyh396gb5kgrwmd
> >
> > Best regards,
> > Yuxia
> >
>
>
> --
>
> Best,
> Benchao Li
>

Re: [VOTE] FLIP-295: Support lazy initialization of catalogs and persistence of catalog configurations

2023-06-14 Thread Rui Fan

+1(binding)

Best,
Rui Fan

On Wed, Jun 14, 2023 at 16:24 Hang Ruan  wrote:

> +1 (non-binding)
>
> Thanks for Feng driving it.
>
> Best,
> Hang
>
> Feng Jin  于2023年6月14日周三 10:36写道：
>
> > Hi everyone
> >
> > Thanks for all the feedback about the FLIP-295: Support lazy
> initialization
> > of catalogs and persistence of catalog configurations[1].
> > [2] is the discussion thread.
> >
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours(excluding weekends,until June 19, 10:00AM GMT) unless there is an
> > objection or an insufficient number of votes.
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-295%3A+Support+lazy+initialization+of+catalogs+and+persistence+of+catalog+configurations
> > [2]https://lists.apache.org/thread/dcwgv0gmngqt40fl3694km53pykocn5s
> >
> >
> > Best,
> > Feng
> >
>

Re: [VOTE] FLIP-246: Dynamic Kafka Source (originally Multi Cluster Kafka Source)

2023-06-22 Thread Rui Fan

+1 (binding)

+1 for DynamicKafkaSource

Best,
Rui Fan

On Wed, Jun 21, 2023 at 6:57 PM Thomas Weise  wrote:

> +1 (binding)
>
>
> On Mon, Jun 19, 2023 at 8:09 AM Ryan van Huuksloot
>  wrote:
>
> > +1 (non-binding)
> >
> > +1 for DynamicKafkaSource
> >
> > Ryan van Huuksloot
> > Sr. Production Engineer | Streaming Platform
> > [image: Shopify]
> > <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email
> >
> >
> >
> > On Mon, Jun 19, 2023 at 8:15 AM Martijn Visser  >
> > wrote:
> >
> > > +1 (binding)
> > >
> > > +1 for DynamicKafkaSource
> > >
> > >
> > > On Sat, Jun 17, 2023 at 5:31 AM Tzu-Li (Gordon) Tai <
> tzuli...@apache.org
> > >
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > +1 for either DynamicKafkaSource or DiscoveringKafkaSource
> > > >
> > > > Cheers,
> > > > Gordon
> > > >
> > > > On Thu, Jun 15, 2023, 10:56 Mason Chen 
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Thank you to everyone for the feedback on FLIP-246 [1]. Based on
> the
> > > > > discussion thread [2], we have come to a consensus on the design
> and
> > > are
> > > > > ready to take a vote to contribute this to Flink.
> > > > >
> > > > > This voting thread will be open for at least 72 hours (excluding
> > > > weekends,
> > > > > until June 20th 10:00AM PST) unless there is an objection or an
> > > > > insufficient number of votes.
> > > > >
> > > > > (Optional) If you have an opinion on the naming of the connector,
> > > please
> > > > > include it in your vote:
> > > > > 1. DynamicKafkaSource
> > > > > 2. MultiClusterKafkaSource
> > > > > 3. DiscoveringKafkaSource
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217389320
> > > > > [2]
> https://lists.apache.org/thread/vz7nw5qzvmxwnpktnofc9p13s1dzqm6z
> > > > >
> > > > > Best,
> > > > > Mason
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] FLIP-324: Introduce Runtime Filter for Flink Batch Jobs

2023-06-23 Thread Rui Fan

+1(binding), thanks for driving this improvement.

Best,
Rui Fan

On Sat, Jun 24, 2023 at 4:55 AM Jing Ge  wrote:

> +1(binding)
>
> Best Regards,
> Jing
>
> On Fri, Jun 23, 2023 at 5:50 PM Lijie Wang 
> wrote:
>
> > Hi all,
> >
> > Thanks for all the feedback about the FLIP-324: Introduce Runtime Filter
> > for Flink Batch Jobs[1]. This FLIP was discussed in [2].
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours (until June 29th 12:00 GMT) unless there is an objection or
> > insufficient votes.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-324%3A+Introduce+Runtime+Filter+for+Flink+Batch+Jobs
> > [2] https://lists.apache.org/thread/mm0o8fv7x7k13z11htt88zhy7lo8npmg
> >
> > Best,
> > Lijie
> >
>

Re: Re: [ANNOUNCE] Apache Flink has won the 2023 SIGMOD Systems Award

2023-07-04 Thread Rui Fan

Congratulations!

Best,
Rui Fan

On Tue, Jul 4, 2023 at 2:08 PM Zhu Zhu  wrote:

> Congratulations everyone!
>
> Thanks,
> Zhu
>
> Hang Ruan  于2023年7月4日周二 14:06写道：
> >
> > Congratulations!
> >
> > Best,
> > Hang
> >
> > Jingsong Li  于2023年7月4日周二 13:47写道：
> >
> > > Congratulations!
> > >
> > > Thank you! All of the Flink community!
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jul 4, 2023 at 1:24 PM tison  wrote:
> > > >
> > > > Congrats and with honor :D
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Mang Zhang  于2023年7月4日周二 11:08写道：
> > > >
> > > > > Congratulations!--
> > > > >
> > > > > Best regards,
> > > > > Mang Zhang
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 在 2023-07-04 01:53:46，"liu ron"  写道：
> > > > > >Congrats everyone
> > > > > >
> > > > > >Best,
> > > > > >Ron
> > > > > >
> > > > > >Jark Wu  于2023年7月3日周一 22:48写道：
> > > > > >
> > > > > >> Congrats everyone!
> > > > > >>
> > > > > >> Best,
> > > > > >> Jark
> > > > > >>
> > > > > >> > 2023年7月3日 22:37，Yuval Itzchakov  写道：
> > > > > >> >
> > > > > >> > Congrats team!
> > > > > >> >
> > > > > >> > On Mon, Jul 3, 2023, 17:28 Jing Ge via user <
> > > u...@flink.apache.org
> > > > > >> <mailto:u...@flink.apache.org>> wrote:
> > > > > >> >> Congratulations!
> > > > > >> >>
> > > > > >> >> Best regards,
> > > > > >> >> Jing
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Jul 3, 2023 at 3:21 PM yuxia <
> > > luoyu...@alumni.sjtu.edu.cn
> > > > > >> <mailto:luoyu...@alumni.sjtu.edu.cn>> wrote:
> > > > > >> >>> Congratulations!
> > > > > >> >>>
> > > > > >> >>> Best regards,
> > > > > >> >>> Yuxia
> > > > > >> >>>
> > > > > >> >>> 发件人: "Pushpa Ramakrishnan"  > >  > > > > >> pushpa.ramakrish...@icloud.com>>
> > > > > >> >>> 收件人: "Xintong Song"  > > > > >> tonysong...@gmail.com>>
> > > > > >> >>> 抄送: "dev"  dev@flink.apache.org>>,
> > > > > >> "User" mailto:u...@flink.apache.org>>
> > > > > >> >>> 发送时间: 星期一, 2023年 7 月 03日 下午 8:36:30
> > > > > >> >>> 主题: Re: [ANNOUNCE] Apache Flink has won the 2023 SIGMOD
> Systems
> > > > > Award
> > > > > >> >>>
> > > > > >> >>> Congratulations \uD83E\uDD73
> > > > > >> >>>
> > > > > >> >>> On 03-Jul-2023, at 3:30 PM, Xintong Song <
> tonysong...@gmail.com
> > > > > >> <mailto:tonysong...@gmail.com>> wrote:
> > > > > >> >>>
> > > > > >> >>> 
> > > > > >> >>> Dear Community,
> > > > > >> >>>
> > > > > >> >>> I'm pleased to share this good news with everyone. As some
> of
> > > you
> > > > > may
> > > > > >> have already heard, Apache Flink has won the 2023 SIGMOD Systems
> > > Award
> > > > > [1].
> > > > > >> >>>
> > > > > >> >>> "Apache Flink greatly expanded the use of stream
> > > data-processing."
> > > > > --
> > > > > >> SIGMOD Awards Committee
> > > > > >> >>>
> > > > > >> >>> SIGMOD is one of the most influential data management
> research
> > > > > >> conferences in the world. The Systems Award is awarded to an
> > > individual
> > > > > or
> > > > > >> set of individuals to recognize the development of a software or
> > > > > hardware
> > > > > >> system whose technical contributions have had significant
> impact on
> > > the
> > > > > >> theory or practice of large-scale data management systems.
> Winning
> > > of
> > > > > the
> > > > > >> award indicates the high recognition of Flink's technological
> > > > > advancement
> > > > > >> and industry influence from academia.
> > > > > >> >>>
> > > > > >> >>> As an open-source project, Flink wouldn't have come this far
> > > without
> > > > > >> the wide, active and supportive community behind it. Kudos to
> all
> > > of us
> > > > > who
> > > > > >> helped make this happen, including the over 1,400 contributors
> and
> > > many
> > > > > >> others who contributed in ways beyond code.
> > > > > >> >>>
> > > > > >> >>> Best,
> > > > > >> >>> Xintong (on behalf of the Flink PMC)
> > > > > >> >>>
> > > > > >> >>> [1] https://sigmod.org/2023-sigmod-systems-award/
> > > > > >> >>>
> > > > > >>
> > > > > >>
> > > > >
> > >
>

Re: [ANNOUNCE] Flink 1.18 Feature Freeze Extended until July 24th, 2023

2023-07-06 Thread Rui Fan

Thanks for the update, and thank you for your efforts for the 1.18 release!

Best,
Rui Fan

On Thu, Jul 6, 2023 at 2:40 PM Qingsheng Ren  wrote:

> Hi devs,
>
> Recently we collected some feedback from developers, and in order to give
> more time for polishing some important features in 1.18, we decide to
> extend the feature freezing date to:
>
> July 24th, 2023, at 00:00 CEST(UTC+2)
>
> which gives us ~2 weeks for development from now. There will be no
> extension after Jul 24, so please arrange new features in the next release
> if they cannot be finished before the closing date.
>
> Thanks everyone for your work in 1.18!
>
> Best regards,
> Qingsheng, Jing, Konstantin and Sergey
>

Re: [VOTE] FLIP-309: Support using larger checkpointing interval when source is processing backlog

2023-07-17 Thread Rui Fan

+1(binding)

Best,
Rui Fan


On Tue, Jul 18, 2023 at 12:04 PM Dong Lin  wrote:

> Hi all,
>
> We would like to start the vote for FLIP-309: Support using larger
> checkpointing interval when source is processing backlog [1]. This FLIP was
> discussed in this thread [2].
>
> The vote will be open until at least July 21st (at least 72 hours),
> following
> the consensus voting process.
>
> Cheers,
> Yunfeng and Dong
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-309
>
> %3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog
> [2] https://lists.apache.org/thread/l1l7f30h7zldjp6ow97y70dcthx7tl37
>

Re: Kubernetes Operator 1.6.0 release planning

2023-07-19 Thread Rui Fan

Thanks Gyula for driving this release.

+1 for the timeline

Best,
Rui Fan

On Wed, Jul 19, 2023 at 11:03 PM Gyula Fóra  wrote:

> Hi Devs!
>
> Based on our release schedule, it is about time for the next Flink K8s
> Operator minor release.
>
> There are still some minor work items to be completed this week, but I
> suggest aiming for next Wednesday (July 26th) as the 1.6.0 release-cut -
> RC1 date.
>
> I am volunteering as the release manager but if someone else wants to do
> it, I would also be happy to simply give assistance :)
>
> Please let me know if you agree or disagree with the suggested timeline.
>
> Cheers,
> Gyula
>

Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang

2023-07-24 Thread Rui Fan

Congratulations, Yong Fang

Best,
Rui Fan

On Mon, Jul 24, 2023 at 3:42 PM Matt Wang  wrote:

> Congratulations, Yong Fang
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Feng Jin |
> | Date | 07/24/2023 15:27 |
> | To |  |
> | Cc | Shammon FY |
> | Subject | Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang |
> Congratulations, Yong Fang
>
> Best,
> Feng
>
>
> On Mon, Jul 24, 2023 at 3:12 PM Leonard Xu  wrote:
>
> Congratulations, Yong Fang
>
> Best,
> Leonard
>
> On Jul 24, 2023, at 2:01 PM, yuxia  wrote:
>
> Congrats, Shammon!
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "Benchao Li" 
> 收件人: "dev" 
> 抄送: "Shammon FY" 
> 发送时间: 星期一, 2023年 7 月 24日 下午 1:23:55
> 主题: Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang
>
> Congratulations, Yong! Well deserved!
>
> Yangze Guo  于2023年7月24日周一 12:16写道：
>
> Congrats, Yong!
>
> Best,
> Yangze Guo
>
> On Mon, Jul 24, 2023 at 12:02 PM xiangyu feng 
> wrote:
>
> Congratulations, Yong!
>
> Best,
> Xiangyu
>
> liu ron  于2023年7月24日周一 11:48写道：
>
> Congratulations,
>
> Best,
> Ron
>
> Qingsheng Ren  于2023年7月24日周一 11:18写道：
>
> Congratulations and welcome aboard, Yong!
>
> Best,
> Qingsheng
>
> On Mon, Jul 24, 2023 at 11:14 AM Chen Zhanghao <
> zhanghao.c...@outlook.com>
> wrote:
>
> Congrats, Shammon!
>
> Best,
> Zhanghao Chen
> 
> 发件人: Weihua Hu 
> 发送时间: 2023年7月24日 11:11
> 收件人: dev@flink.apache.org 
> 抄送: Shammon FY 
> 主题: Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang
>
> Congratulations!
>
> Best,
> Weihua
>
>
> On Mon, Jul 24, 2023 at 11:04 AM Paul Lam 
> wrote:
>
> Congrats, Shammon!
>
> Best,
> Paul Lam
>
> 2023年7月24日 10:56，Jingsong Li  写道：
>
> Shammon
>
>
>
>
>
>
>
>
> --
>
> Best,
> Benchao Li
>
>
>

Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.

2023-07-29 Thread Rui Fan

+1(binding)

Thanks for driving this proposal, it will be useful for rescale.

I’m preparing the FLIP-334[1], it will decouple the autoscaler and
kubernetes. In the end, we hope all kind of flink jobs work well with
rescale and autoscaler.

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424711

Best
Rui Fan

On Fri, 28 Jul 2023 at 19:21, Martijn Visser 
wrote:

> +1 (binding)
>
> On Fri, Jul 14, 2023 at 11:59 AM Prabhu Joseph  >
> wrote:
>
> > *+1 (non-binding)*
> >
> > Thanks for working on this. We have seen good improvement during the cool
> > down period with this feature.
> > Below are details on the test results from one of our clusters:
> >
> > On a scale-out operation, 8 new nodes were added one by one with a gap of
> > ~30 seconds. There were 8 restarts within 4 minutes with the default
> > behaviour,
> > whereas only one with this feature (cooldown period of 4 minutes).
> >
> > The number of records processed by the job with this feature during the
> > restart window is higher (2909764), whereas it is only 1323960 with the
> > default
> > behaviour due to multiple restarts, where it spends most of the time
> > recovering, and also whatever work progressed by the tasks after the last
> > successful completed checkpoint is lost.
> >
> > Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown
> Period
> > Remarks
> > NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric
> indicates
> > the difference the cool down period brings in. When the job is doing
> > multiple restarts, the task spends most of the time recovering, and the
> > progress the task made will be lost during the restart.
> >
> > 2. There is only one restart with Cool Down Period which happened when
> the
> > 8th node got added back.
> >
> > Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69
> > NumRestarts 8 1
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'm going on vacation tonight for 3 weeks.
> > >
> > > Even if the vote is not finished, as the implementation is rather quick
> > > and the design discussion had settled, I preferred I implementing
> > > FLIP-322 [1] to allow people to take a look while I'm off.
> > >
> > > [1] https://github.com/apache/flink/pull/22985
> > >
> > > Best
> > >
> > > Etienne
> > >
> > > Le 12/07/2023 à 09:56, Etienne Chauchot a écrit :
> > > >
> > > > Hi all,
> > > >
> > > > Would you mind casting your vote to this second vote thread (opened
> > > > after new discussions) so that the subject can move forward ?
> > > >
> > > > @David, @Chesnay, @Robert you took part to the discussions, can you
> > > > please sent your vote ?
> > > >
> > > > Thank you very much
> > > >
> > > > Best
> > > >
> > > > Etienne
> > > >
> > > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit :
> > > >>
> > > >> Hi all,
> > > >>
> > > >> Thanks for your feedback about the FLIP-322: Cooldown period for
> > > >> adaptive scheduler [1].
> > > >>
> > > >> This FLIP was discussed in [2].
> > > >>
> > > >> I'd like to start a vote for it. The vote will be open for at least
> 72
> > > >> hours (until July 9th 15:00 GMT) unless there is an objection or
> > > >> insufficient votes.
> > > >>
> > > >> [1]
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
> > > >> [2]
> https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6
> > > >>
> > > >> Best,
> > > >>
> > > >> Etienne
> >
>

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.0, release candidate #1

2023-07-29 Thread Rui Fan

+1 (non-binding)

- Compiled and tested the source code via mvn verify
- Verified the signatures
- Downloaded the image : docker pull
ghcr.io/apache/flink-kubernetes-operator:e7045a6
- Deployed helm chart to test cluster
- Ran example job

Best,
Rui Fan

On Thu, Jul 27, 2023 at 10:53 PM Gyula Fóra  wrote:

> Hi Everyone,
>
> Please review and vote on the release candidate #1 for the version 1.6.0 of
> Apache Flink Kubernetes Operator,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> **Release Overview**
>
> As an overview, the release consists of the following:
> a) Kubernetes Operator canonical source distribution (including the
> Dockerfile), to be deployed to the release repository at dist.apache.org
> b) Kubernetes Operator Helm Chart to be deployed to the release repository
> at dist.apache.org
> c) Maven artifacts to be deployed to the Maven Central Repository
> d) Docker image to be pushed to dockerhub
>
> **Staging Areas to Review**
>
> The staging areas containing the above mentioned artifacts are as follows,
> for your review:
> * All artifacts for a,b) can be found in the corresponding dev repository
> at dist.apache.org [1]
> * All artifacts for c) can be found at the Apache Nexus Repository [2]
> * The docker image for d) is staged on github [3]
>
> All artifacts are signed with the key 21F06303B87DAFF1 [4]
>
> Other links for your review:
> * JIRA release notes [5]
> * source code tag "release-1.6.0-rc1" [6]
> * PR to update the website Downloads page to
> include Kubernetes Operator links [7]
>
> **Vote Duration**
>
> The voting time will run for at least 72 hours.
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
> **Note on Verification**
>
> You can follow the basic verification guide here[8].
> Note that you don't need to verify everything yourself, but please make
> note of what you have tested together with your +- vote.
>
> Cheers!
> Gyula Fora
>
> [1]
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.0-rc1/
> [2]
> https://repository.apache.org/content/repositories/orgapacheflink-1646/
> [3] ghcr.io/apache/flink-kubernetes-operator:e7045a6
> [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> [5]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353230
> [6]
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.0-rc1
> [7] https://github.com/apache/flink-web/pull/666
> [8]
>
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
>

[DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-01 Thread Rui Fan

Hi all,

I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the autoscaler
and kubernetes.

Currently, the flink-autoscaler is tightly integrated with Kubernetes.
There are compelling reasons to extend the use of flink-autoscaler to
more types of Flink jobs:
1. With the recent merge of the Externalized Declarative Resource
Management (FLIP-291[2]), in-place scaling is now supported
across all types of Flink jobs. This development has made scaling Flink on
YARN a straightforward process.
2. Several discussions[3] within the Flink user community, as observed in
the mail list , have emphasized the necessity of flink-autoscaler
supporting
Flink on YARN.

Please refer to the FLIP[1] document for more details about the proposed
design and implementation. We welcome any feedback and opinions on
this proposal.

[1] https://cwiki.apache.org/confluence/x/x4qzDw
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
[3] https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-01 Thread Rui Fan

Hi Max,

Thanks for your quick response!

> 1. Handle state in the AutoScalerEventHandler which will receive
> all related scaling and metric collection events, and can keep
> track of any state.

If I understand correctly, you mean that updating state is just part of
handling events, right?

If yes, sounds make sense. However, I have some concerns:

- Currently, we have 3 key-values that need to be stored. And the
autoscaler needs to get them first, then update them, and sometimes
remove them. If we use AutoScalerEventHandler, we need to provided
9 methods, right? Every key has 3 methods.
- Do we add the persistState interface for AutoScalerEventHandler to
persist in-memory state to kubernetes?

> 2. In the long run, the autoscaling logic can move to a
> separate repository, although this will complicate the release
> process, so I would defer this unless there is strong interest.

I also agree to leave it in flink-k8s-operator for now. Unless moving it
out of flink-k8s-operator is necessary in the future.

> 3. The proposal mentions some removal of tests.

Sorry, I didn't express clearly in FLIP. POC just check whether these
interfaces work well. It will take more time if I develop all the tests
during POC. So I removed these tests in my POC.

These tests will be completed in the final PR, and the test is very useful
for less bugs.

Best,
Rui Fan

On Tue, Aug 1, 2023 at 10:10 PM Maximilian Michels  wrote:

> Hi Rui,
>
> Thanks for the proposal. I think it makes a lot of sense to decouple
> the autoscaler from Kubernetes-related dependencies. A couple of notes
> when I read the proposal:
>
> 1. You propose AutoScalerEventHandler, AutoScalerStateStore,
> AutoScalerStateStoreFactory, and AutoScalerEventHandler.
> AutoscalerStateStore is a generic key/value database (methods:
> "get"/"put"/"delete"). I would propose to refine this interface and
> make it less general purpose, e.g. add a method for persisting scaling
> decisions as well as any metrics gathered for the current metric
> window. For simplicity, I'd even go so far to remove the state store
> entirely, but rather handle state in the AutoScalerEventHandler which
> will receive all related scaling and metric collection events, and can
> keep track of any state.
>
> 2. You propose to make the current autoscaler module
> Kubernetes-agnostic by moving the Kubernetes parts into the main
> operator module. I think that makes sense since the Kubernetes
> implementation will continue to be tightly coupled with Kubernetes.
> The goal of the separate module was to make the autoscaler logic
> pluggable, but this will continue to be possible with the new
> "flink-autoscaler" module which contains the autoscaling logic and
> interfaces. In the long run, the autoscaling logic can move to a
> separate repository, although this will complicate the release
> process, so I would defer this unless there is strong interest.
>
> 3. The proposal mentions some removal of tests. It is critical for us
> that all test coverage of the current implementation remains active.
> It is ok if some of the test coverage only covers the Kubernetes
> implementation. We can eventually move more tests without Kubernetes
> significance into the implementation-agnostic autoscaler tests.
>
> -Max
>
> On Tue, Aug 1, 2023 at 9:46 AM Rui Fan  wrote:
> >
> > Hi all,
> >
> > I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the autoscaler
> > and kubernetes.
> >
> > Currently, the flink-autoscaler is tightly integrated with Kubernetes.
> > There are compelling reasons to extend the use of flink-autoscaler to
> > more types of Flink jobs:
> > 1. With the recent merge of the Externalized Declarative Resource
> > Management (FLIP-291[2]), in-place scaling is now supported
> > across all types of Flink jobs. This development has made scaling Flink
> on
> > YARN a straightforward process.
> > 2. Several discussions[3] within the Flink user community, as observed in
> > the mail list , have emphasized the necessity of flink-autoscaler
> > supporting
> > Flink on YARN.
> >
> > Please refer to the FLIP[1] document for more details about the proposed
> > design and implementation. We welcome any feedback and opinions on
> > this proposal.
> >
> > [1] https://cwiki.apache.org/confluence/x/x4qzDw
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > [3] https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v
>

Re: [VOTE] FLIP-333: Redesign Apache Flink website

2023-08-03 Thread Rui Fan

+1(binding), thanks for driving this proposal, it's cool !

Best,
Rui Fan

On Thu, Aug 3, 2023 at 6:06 PM Jing Ge  wrote:

> +1, thanks for driving it!
>
> Best regards,
> Jing
>
> On Thu, Aug 3, 2023 at 4:49 AM Mohan, Deepthi 
> wrote:
>
> > Hi,
> >
> > Thank you all for your feedback on FLIP-333. I’d like to start a vote.
> >
> > Discussion thread:
> > https://lists.apache.org/thread/z9j0rqt61ftgbkr37gzwbjg0n4fl1hsf
> > FLIP:
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-333%3A+Redesign+Apache+Flink+website
> >
> >
> > Thanks,
> > Deepthi
> >
>

Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu

2023-08-03 Thread Rui Fan

Congratulations Weihua, well deserved!

Best,
Rui Fan

On Fri, Aug 4, 2023 at 11:19 AM Xintong Song  wrote:

> Hi everyone,
>
> On behalf of the PMC, I'm very happy to announce Weihua Hu as a new Flink
> Committer!
>
> Weihua has been consistently contributing to the project since May 2022. He
> mainly works in Flink's distributed coordination areas. He is the main
> contributor of FLIP-298 and many other improvements in large-scale job
> scheduling and improvements. He is also quite active in mailing lists,
> participating discussions and answering user questions.
>
> Please join me in congratulating Weihua!
>
> Best,
>
> Xintong (on behalf of the Apache Flink PMC)
>

Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl

2023-08-03 Thread Rui Fan

Congratulations Matthias, well deserved!

Best,
Rui Fan

On Fri, Aug 4, 2023 at 11:30 AM Leonard Xu  wrote:

> Congratulations,  Matthias.
>
> Well deserved ^_^
>
> Best,
> Leonard
>
>
> > On Aug 4, 2023, at 11:18 AM, Xintong Song  wrote:
> >
> > Hi everyone,
> >
> > On behalf of the PMC, I'm very happy to announce that Matthias Pohl has
> > joined the Flink PMC!
> >
> > Matthias has been consistently contributing to the project since Sep
> 2020,
> > and became a committer in Dec 2021. He mainly works in Flink's
> distributed
> > coordination and high availability areas. He has worked on many FLIPs
> > including FLIP195/270/285. He helped a lot with the release management,
> > being one of the Flink 1.17 release managers and also very active in
> Flink
> > 1.18 / 2.0 efforts. He also contributed a lot to improving the build
> > stability.
> >
> > Please join me in congratulating Matthias!
> >
> > Best,
> >
> > Xintong (on behalf of the Apache Flink PMC)
>
>

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-03 Thread Rui Fan

Hi Max,

After careful consideration, I prefer to keep the AutoScalerStateStore
instead of AutoScalerEventHandler taking over the work of
AutoScalerStateStore. And the following are some reasons:

1. Keeping the AutoScalerStateStore to make StateStore easy to plug in.

Currently, the kubernetes-operator-autoscaler uses the ConfigMap as the
state store. However, users may use a different state store for
yarn-autoscaler or generic autoscaler. Such as: MySQL StateStore,
Heaped StateStore or PostgreSQL StateStore, etc.

Of course, kubernetes autoscaler can also use the MySQL StateStore.
If the AutoScalerEventHandler is responsible for recording events,
scaling job and accessing state, whenever users or community want to
create a new state store, they must also implement the new
AutoScalerEventHandler, it includes recording events and scaling job.

If we decouple AutoScalerEventHandler and AutoScalerStateStore,
it's easy to implement a new state store.

2. AutoScalerEventHandler isn't suitable for access state.

Sometimes the generic autoscaler needs to query state,
AutoScalerEventHandler can update the state during handling events.
However, it's wired to provide a series of interfaces to query state.

What do you think?

And looking forward to more thoughts from the community, thanks!

Best,
Rui Fan

On Tue, Aug 1, 2023 at 11:47 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Max,
>
> Thanks for your quick response!
>
> > 1. Handle state in the AutoScalerEventHandler which will receive
> > all related scaling and metric collection events, and can keep
> > track of any state.
>
> If I understand correctly, you mean that updating state is just part of
> handling events, right?
>
> If yes, sounds make sense. However, I have some concerns:
>
> - Currently, we have 3 key-values that need to be stored. And the
> autoscaler needs to get them first, then update them, and sometimes
> remove them. If we use AutoScalerEventHandler, we need to provided
> 9 methods, right? Every key has 3 methods.
> - Do we add the persistState interface for AutoScalerEventHandler to
> persist in-memory state to kubernetes?
>
> > 2. In the long run, the autoscaling logic can move to a
> > separate repository, although this will complicate the release
> > process, so I would defer this unless there is strong interest.
>
> I also agree to leave it in flink-k8s-operator for now. Unless moving it
> out of flink-k8s-operator is necessary in the future.
>
> > 3. The proposal mentions some removal of tests.
>
> Sorry, I didn't express clearly in FLIP. POC just check whether these
> interfaces work well. It will take more time if I develop all the tests
> during POC. So I removed these tests in my POC.
>
> These tests will be completed in the final PR, and the test is very useful
> for less bugs.
>
> Best,
> Rui Fan
>
> On Tue, Aug 1, 2023 at 10:10 PM Maximilian Michels  wrote:
>
>> Hi Rui,
>>
>> Thanks for the proposal. I think it makes a lot of sense to decouple
>> the autoscaler from Kubernetes-related dependencies. A couple of notes
>> when I read the proposal:
>>
>> 1. You propose AutoScalerEventHandler, AutoScalerStateStore,
>> AutoScalerStateStoreFactory, and AutoScalerEventHandler.
>> AutoscalerStateStore is a generic key/value database (methods:
>> "get"/"put"/"delete"). I would propose to refine this interface and
>> make it less general purpose, e.g. add a method for persisting scaling
>> decisions as well as any metrics gathered for the current metric
>> window. For simplicity, I'd even go so far to remove the state store
>> entirely, but rather handle state in the AutoScalerEventHandler which
>> will receive all related scaling and metric collection events, and can
>> keep track of any state.
>>
>> 2. You propose to make the current autoscaler module
>> Kubernetes-agnostic by moving the Kubernetes parts into the main
>> operator module. I think that makes sense since the Kubernetes
>> implementation will continue to be tightly coupled with Kubernetes.
>> The goal of the separate module was to make the autoscaler logic
>> pluggable, but this will continue to be possible with the new
>> "flink-autoscaler" module which contains the autoscaling logic and
>> interfaces. In the long run, the autoscaling logic can move to a
>> separate repository, although this will complicate the release
>> process, so I would defer this unless there is strong interest.
>>
>> 3. The proposal mentions some removal of tests. It is critical for us
>> that all test coverage of the current implementation remains active.
>> It is ok if some of the test coverage onl

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-06 Thread Rui Fan

Hi Ron:

Thanks for the feedback! The goal is indeed to turn the autoscaler into
a general tool that can support other resource management.


Hi Max, Gyula:

My proposed `AutoScalerStateStore` is similar to Map, it can really be
improved.

> public interface AutoScalerStateStore {
> Map getState(KEY jobKey)
> void updateState(KEY jobKey, Map state);
> }

>From the method parameter, the StateStore is shared by all jobs, right?
If yes, the `KEY jobKey` isn't enough because the CR is needed during
creating the ConfigMap. The `jobKey` is ResourceID, the extraJobInfo is CR.

So, this parameter may need to be changed from `KEY jobKey` to
`JobAutoScalerContext`. Does that make sense?
If yes, I can update the interface in the FLIP doc.

Best,
Rui

On Mon, Aug 7, 2023 at 10:18 AM liu ron  wrote:

> Hi, Rui
>
> Thanks for driving the FLIP.
>
> The tuning of streaming jobs by autoscaler is very important. Although the
> mainstream trend now is cloud-native, many companies still run their Flink
> jobs on Yarn for historical reasons. If we can decouple autoscaler from K8S
> and turn it into a common tool that can support other resource management
> frameworks such as Yarn, I think it will be very meaningful.
> +1 for this proposal.
>
> Best,
> Ron
>
>
> Gyula Fóra  于2023年8月5日周六 15:03写道：
>
> > Hi Rui!
> >
> > Thanks for the proposal.
> >
> > I agree with Max on that the state store abstractions could be improved
> and
> > be more specific as we know what goes into the state. It could simply be
> >
> > public interface AutoScalerStateStore {
> > Map getState(KEY jobKey)
> > void updateState(KEY jobKey, Map state);
> > }
> >
> >
> > We could also remove the entire recommended parallelism logic from the
> > interface and make it internal to the implementation somehow because it's
> > not very nice in the current form.
> >
> > Cheers,
> > Gyula
> >
> > On Fri, Aug 4, 2023 at 7:05 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Max,
> > >
> > > After careful consideration, I prefer to keep the AutoScalerStateStore
> > > instead of AutoScalerEventHandler taking over the work of
> > > AutoScalerStateStore. And the following are some reasons:
> > >
> > > 1. Keeping the AutoScalerStateStore to make StateStore easy to plug in.
> > >
> > > Currently, the kubernetes-operator-autoscaler uses the ConfigMap as the
> > > state store. However, users may use a different state store for
> > > yarn-autoscaler or generic autoscaler. Such as: MySQL StateStore,
> > > Heaped StateStore or PostgreSQL StateStore, etc.
> > >
> > > Of course, kubernetes autoscaler can also use the MySQL StateStore.
> > > If the AutoScalerEventHandler is responsible for recording events,
> > > scaling job and accessing state, whenever users or community want to
> > > create a new state store, they must also implement the new
> > > AutoScalerEventHandler, it includes recording events and scaling job.
> > >
> > > If we decouple AutoScalerEventHandler and AutoScalerStateStore,
> > > it's easy to implement a new state store.
> > >
> > > 2. AutoScalerEventHandler isn't suitable for access state.
> > >
> > > Sometimes the generic autoscaler needs to query state,
> > > AutoScalerEventHandler can update the state during handling events.
> > > However, it's wired to provide a series of interfaces to query state.
> > >
> > > What do you think?
> > >
> > > And looking forward to more thoughts from the community, thanks!
> > >
> > > Best,
> > > Rui Fan
> > >
> > > On Tue, Aug 1, 2023 at 11:47 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > Hi Max,
> > > >
> > > > Thanks for your quick response!
> > > >
> > > > > 1. Handle state in the AutoScalerEventHandler which will receive
> > > > > all related scaling and metric collection events, and can keep
> > > > > track of any state.
> > > >
> > > > If I understand correctly, you mean that updating state is just part
> of
> > > > handling events, right?
> > > >
> > > > If yes, sounds make sense. However, I have some concerns:
> > > >
> > > > - Currently, we have 3 key-values that need to be stored. And the
> > > > autoscaler needs to get them first, then update them, and sometimes
> > > > remove them. If we use AutoScalerEventHandler, we ne

Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei

2023-08-07 Thread Rui Fan

Congratulations Yanfei!

Best,
Rui

On Mon, Aug 7, 2023 at 2:56 PM Yuan Mei  wrote:

> On behalf of the PMC, I'm happy to announce Yanfei Lei as a new Flink
> Committer.
>
> Yanfei has been active in the Flink community for almost two years and has
> played an important role in developing and maintaining State and Checkpoint
> related features/components, including RocksDB Rescaling Performance
> Improvement and Generic Incremental Checkpoints.
>
> Yanfei also helps improve community infrastructure in many ways, including
> migrating the Flink Daily performance benchmark to the Apache Flink slack
> channel. She is the maintainer of the benchmark and has improved its
> detection stability significantly. She is also one of the major maintainers
> of the FrocksDB Repo and released FRocksDB 6.20.3 (part of Flink 1.17
> release). Yanfei is a very active community member, supporting users and
> participating
> in tons of discussions on the mailing lists.
>
> Please join me in congratulating Yanfei for becoming a Flink Committer!
>
> Thanks,
> Yuan Mei (on behalf of the Flink PMC)
>

Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu

2023-08-07 Thread Rui Fan

Congratulations Hangxiang!

Best,
Rui

On Mon, Aug 7, 2023 at 2:58 PM Yuan Mei  wrote:

> On behalf of the PMC, I'm happy to announce Hangxiang Yu as a new Flink
> Committer.
>
> Hangxiang has been active in the Flink community for more than 1.5 years
> and has played an important role in developing and maintaining State and
> Checkpoint related features/components, including Generic Incremental
> Checkpoints (take great efforts to make the feature prod-ready). Hangxiang
> is also the main driver of the FLIP-263: Resolving schema compatibility.
>
> Hangxiang is passionate about the Flink community. Besides the technical
> contribution above, he is also actively promoting Flink: talks about
> Generic
> Incremental Checkpoints in Flink Forward and Meet-up. Hangxiang also spent
> a good amount of time supporting users, participating in Jira/mailing list
> discussions, and reviewing code.
>
> Please join me in congratulating Hangxiang for becoming a Flink Committer!
>
> Thanks,
> Yuan Mei (on behalf of the Flink PMC)
>

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-08 Thread Rui Fan

Hi Matt Wang,

Thanks for your discussion here.

> it is recommended to unify the descriptions of AutoScalerHandler
> and AutoScalerEventHandler in the FLIP

Good catch, I have updated all AutoScalerHandler to
AutoScalerEventHandler.

> Can it support the use of zookeeper (zookeeper is a relatively
> common use of flink HA)?

In my opinion, it's a good suggestion. However, I prefer we
implement other state stores in the other FLINK JIRA, and
this FLIP focus on the decoupling and implementing the
necessary state store. Does that make sense?

> Regarding each scaling information, can it be persisted in
>  the shared file system through the filesystem? I think it will
>  be a more valuable requirement to support viewing
>  Autoscaling info on the UI in the future, which can provide
>  some foundations in advance;

This is a good suggestion as well. It's useful for users to check
the scaling information. I propose to add a CompositeEventHandler,
it can include multiple EventHandlers.

However, as the last question, I prefer we implement other
event handler in the other FLINK JIRA. What do you think?

> A solution mentioned in FLIP is to initialize the
>  AutoScalerEventHandler object every time an event is
>  processed.

No, the FLIP mentioned `The AutoScalerEventHandler  object is shared for
all flink jobs`,
So the AutoScalerEventHandler is only initialized once.

And we call the AutoScalerEventHandler#handlerXXX
every time an event is processed.

Best,
Rui

On Tue, Aug 8, 2023 at 9:40 PM Matt Wang  wrote:

> Hi Rui
>
> Thanks for driving the FLIP.
>
> I agree with the point fo this FLIP. This FLIP first provides a
> general function of Autoscaler in Flink repo, and there is no
> need to move kubernetes-autoscaler from kubernetes-operator
> to Flink repo in this FLIP(it is recommended to unify the
> descriptions of AutoScalerHandler and AutoScalerEventHandler
> in the FLIP). Here I still have a few questions:
>
> 1. AutoScalerStateStore mainly records the state information
> during Scaling. In addition to supporting the use of configmap,
> can it support the use of zookeeper (zookeeper is a relatively
> common use of flink HA)?
> 2. Regarding each scaling information, can it be persisted in
> the shared file system through the filesystem? I think it will
> be a more valuable requirement to support viewing
> Autoscaling info on the UI in the future, which can provide
> some foundations in advance;
> 3. A solution mentioned in FLIP is to initialize the
> AutoScalerEventHandler object every time an event is
> processed. What is the main purpose of this solution?
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Rui Fan<1996fan...@gmail.com> |
> | Date | 08/7/2023 11:34 |
> | To |  |
> | Cc | m...@apache.org ,
> Gyula Fóra |
> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes |
> Hi Ron:
>
> Thanks for the feedback! The goal is indeed to turn the autoscaler into
> a general tool that can support other resource management.
>
>
> Hi Max, Gyula:
>
> My proposed `AutoScalerStateStore` is similar to Map, it can really be
> improved.
>
> public interface AutoScalerStateStore {
> Map getState(KEY jobKey)
> void updateState(KEY jobKey, Map state);
> }
>
> From the method parameter, the StateStore is shared by all jobs, right?
> If yes, the `KEY jobKey` isn't enough because the CR is needed during
> creating the ConfigMap. The `jobKey` is ResourceID, the extraJobInfo is CR.
>
> So, this parameter may need to be changed from `KEY jobKey` to
> `JobAutoScalerContext`. Does that make sense?
> If yes, I can update the interface in the FLIP doc.
>
> Best,
> Rui
>
> On Mon, Aug 7, 2023 at 10:18 AM liu ron  wrote:
>
> Hi, Rui
>
> Thanks for driving the FLIP.
>
> The tuning of streaming jobs by autoscaler is very important. Although the
> mainstream trend now is cloud-native, many companies still run their Flink
> jobs on Yarn for historical reasons. If we can decouple autoscaler from K8S
> and turn it into a common tool that can support other resource management
> frameworks such as Yarn, I think it will be very meaningful.
> +1 for this proposal.
>
> Best,
> Ron
>
>
> Gyula Fóra  于2023年8月5日周六 15:03写道：
>
> Hi Rui!
>
> Thanks for the proposal.
>
> I agree with Max on that the state store abstractions could be improved
> and
> be more specific as we know what goes into the state. It could simply be
>
> public interface AutoScalerStateStore {
> Map getState(KEY jobKey)
> void updateState(KEY jobKey, Map state);
> }
>
>
> We could also remove the entire recommended parallelism logic from

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.0, release candidate #2

2023-08-14 Thread Rui Fan

Thanks Gyula for the release!

+1 (non-binding)

- Compiled and tested the source code via mvn verify
- Verified the signatures
- Downloaded the image
- Deployed helm chart to test cluster
- Ran example job

Best,
Rui

On Mon, Aug 14, 2023 at 3:58 PM Gyula Fóra  wrote:

> +1 (binding)
>
> Verified:
>  - Hashes, signatures, source files contain no binaries
>  - Maven repo contents look good
>  - Verified helm chart, image, deployed stateful and autoscaling examples.
> Operator logs look good
>
> Cheers,
> Gyula
>
> On Thu, Aug 10, 2023 at 3:03 PM Gyula Fóra  wrote:
>
> > Hi Everyone,
> >
> > Please review and vote on the release candidate #2 for the
> > version 1.6.0 of Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [3]
> >
> > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> >
> > Other links for your review:
> > * JIRA release notes [5]
> > * source code tag "release-1.6.0-rc2" [6]
> > * PR to update the website Downloads page to
> > include Kubernetes Operator links [7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > Cheers!
> > Gyula Fora
> >
> > [1]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.0-rc2/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1649/
> > [3] ghcr.io/apache/flink-kubernetes-operator:ebb1fed
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353230
> > [6]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.0-rc2
> > [7] https://github.com/apache/flink-web/pull/666
> > [8]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
>

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-21 Thread Rui Fan

Hi Max, Gyula and Matt,

Do you have any other comments?

The flink-kubernetes-operator 1.6 has been released recently,
it's a good time to kick off this FLIP.

Please let me know if you have any questions or concerns,
looking forward to your feedback, thanks!

Best,
Rui

On Wed, Aug 9, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Matt Wang,
>
> Thanks for your discussion here.
>
> > it is recommended to unify the descriptions of AutoScalerHandler
> > and AutoScalerEventHandler in the FLIP
>
> Good catch, I have updated all AutoScalerHandler to
> AutoScalerEventHandler.
>
> > Can it support the use of zookeeper (zookeeper is a relatively
> > common use of flink HA)?
>
> In my opinion, it's a good suggestion. However, I prefer we
> implement other state stores in the other FLINK JIRA, and
> this FLIP focus on the decoupling and implementing the
> necessary state store. Does that make sense?
>
> > Regarding each scaling information, can it be persisted in
> >  the shared file system through the filesystem? I think it will
> >  be a more valuable requirement to support viewing
> >  Autoscaling info on the UI in the future, which can provide
> >  some foundations in advance;
>
> This is a good suggestion as well. It's useful for users to check
> the scaling information. I propose to add a CompositeEventHandler,
> it can include multiple EventHandlers.
>
> However, as the last question, I prefer we implement other
> event handler in the other FLINK JIRA. What do you think?
>
> > A solution mentioned in FLIP is to initialize the
> >  AutoScalerEventHandler object every time an event is
> >  processed.
>
> No, the FLIP mentioned `The AutoScalerEventHandler  object is shared for
> all flink jobs`,
> So the AutoScalerEventHandler is only initialized once.
>
> And we call the AutoScalerEventHandler#handlerXXX
> every time an event is processed.
>
> Best,
> Rui
>
> On Tue, Aug 8, 2023 at 9:40 PM Matt Wang  wrote:
>
>> Hi Rui
>>
>> Thanks for driving the FLIP.
>>
>> I agree with the point fo this FLIP. This FLIP first provides a
>> general function of Autoscaler in Flink repo, and there is no
>> need to move kubernetes-autoscaler from kubernetes-operator
>> to Flink repo in this FLIP(it is recommended to unify the
>> descriptions of AutoScalerHandler and AutoScalerEventHandler
>> in the FLIP). Here I still have a few questions:
>>
>> 1. AutoScalerStateStore mainly records the state information
>> during Scaling. In addition to supporting the use of configmap,
>> can it support the use of zookeeper (zookeeper is a relatively
>> common use of flink HA)?
>> 2. Regarding each scaling information, can it be persisted in
>> the shared file system through the filesystem? I think it will
>> be a more valuable requirement to support viewing
>> Autoscaling info on the UI in the future, which can provide
>> some foundations in advance;
>> 3. A solution mentioned in FLIP is to initialize the
>> AutoScalerEventHandler object every time an event is
>> processed. What is the main purpose of this solution?
>>
>>
>>
>> --
>>
>> Best,
>> Matt Wang
>>
>>
>>  Replied Message 
>> | From | Rui Fan<1996fan...@gmail.com> |
>> | Date | 08/7/2023 11:34 |
>> | To |  |
>> | Cc | m...@apache.org ,
>> Gyula Fóra |
>> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes
>> |
>> Hi Ron:
>>
>> Thanks for the feedback! The goal is indeed to turn the autoscaler into
>> a general tool that can support other resource management.
>>
>>
>> Hi Max, Gyula:
>>
>> My proposed `AutoScalerStateStore` is similar to Map, it can really be
>> improved.
>>
>> public interface AutoScalerStateStore {
>> Map getState(KEY jobKey)
>> void updateState(KEY jobKey, Map state);
>> }
>>
>> From the method parameter, the StateStore is shared by all jobs, right?
>> If yes, the `KEY jobKey` isn't enough because the CR is needed during
>> creating the ConfigMap. The `jobKey` is ResourceID, the extraJobInfo is
>> CR.
>>
>> So, this parameter may need to be changed from `KEY jobKey` to
>> `JobAutoScalerContext`. Does that make sense?
>> If yes, I can update the interface in the FLIP doc.
>>
>> Best,
>> Rui
>>
>> On Mon, Aug 7, 2023 at 10:18 AM liu ron  wrote:
>>
>> Hi, Rui
>>
>> Thanks for driving the FLIP.
>>
>> The tuning of streaming jobs by

Re: [ANNOUNCE] release-1.18 branch cut

2023-08-23 Thread Rui Fan

Hi Jing,

Thanks for the effort and update!

It means the PR of flink-1.19 can be merged to master branch, right?

Best,
Rui

On Wed, Aug 23, 2023 at 9:29 PM Jing Ge  wrote:

> Hi devs, The release-1.18 branch has been forked out from the master
> branch, with commit ID cfa4a9c35563fd8a5973ec2f35251190e365be14. The
> version on the master branch has been upgraded to 1.19-SNAPSHOT. From now
> on, for PRs that should be presented in 1.18.0, please make sure: Merge the
> PR into both master and release-1.18 branches The JIRA ticket should be
> closed with the correct fix-versions (1.18.0). The umbrella issue [1] for
> release testing has been created. Please create subtasks for your new
> features under this issue, and make a detailed description on how to verify
> it. We plan to finish all release testing in the next two weeks (until Sept
> 05, 2023), and please update the “X-team verified” column in the 1.18
> release wiki page [2] in the meantime. Also, we’d like to thank all
> contributors who put effort into stabilizing the CI on the master branch in
> the past week, and look forward to stabilizing new features in the coming
> weeks. Good luck with your release testing! Best regards,
> Konstantin, Sergey, Qingsheng, and Jing
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-32726
> [2] https://cwiki.apache.org/confluence/display/FLINK/1.18+Release
>

Re: [ANNOUNCE] release-1.18 branch cut

2023-08-23 Thread Rui Fan

Thanks for the quick response!

Best,
Rui

On Wed, Aug 23, 2023 at 9:46 PM Jing Ge  wrote:

> yes please
>
> On Wed, Aug 23, 2023 at 3:35 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Jing,
> >
> > Thanks for the effort and update!
> >
> > It means the PR of flink-1.19 can be merged to master branch, right?
> >
> > Best,
> > Rui
> >
> > On Wed, Aug 23, 2023 at 9:29 PM Jing Ge 
> > wrote:
> >
> > > Hi devs, The release-1.18 branch has been forked out from the master
> > > branch, with commit ID cfa4a9c35563fd8a5973ec2f35251190e365be14. The
> > > version on the master branch has been upgraded to 1.19-SNAPSHOT. From
> now
> > > on, for PRs that should be presented in 1.18.0, please make sure: Merge
> > the
> > > PR into both master and release-1.18 branches The JIRA ticket should be
> > > closed with the correct fix-versions (1.18.0). The umbrella issue [1]
> for
> > > release testing has been created. Please create subtasks for your new
> > > features under this issue, and make a detailed description on how to
> > verify
> > > it. We plan to finish all release testing in the next two weeks (until
> > Sept
> > > 05, 2023), and please update the “X-team verified” column in the 1.18
> > > release wiki page [2] in the meantime. Also, we’d like to thank all
> > > contributors who put effort into stabilizing the CI on the master
> branch
> > in
> > > the past week, and look forward to stabilizing new features in the
> coming
> > > weeks. Good luck with your release testing! Best regards,
> > > Konstantin, Sergey, Qingsheng, and Jing
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-32726
> > > [2] https://cwiki.apache.org/confluence/display/FLINK/1.18+Release
> > >
> >
>

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-09-05 Thread Rui Fan

After discussing this FLIP-334[1] offline with Gyula and Max,
I updated the FLIP based on the latest conclusion.

Big thanks to Gyula and Max for their professional advice!

> Does the interface function of handlerRecommendedParallelism
> in AutoScalerEventHandler conflict with
> handlerScalingFailure/handlerScalingReport (one of the
> handles the event of scale failure, and the other handles
> the event of scale success).
Hi Matt,

You can take a look at the FLIP, I think the issue has been fixed.
Currently, we introduced the ScalingRealizer and
AutoScalerEventHandler interface.

The ScalingRealizer handles scaling action.
- The AutoScalerEventHandler  interface handles loggable events.

Looking forward to your feedback, thanks!

[1] https://cwiki.apache.org/confluence/x/x4qzDw

Best,
Rui

On Thu, Aug 24, 2023 at 10:55 AM Matt Wang  wrote:

> Sorry for the late reply, I still have a small question here:
> Does the interface function of handlerRecommendedParallelism
> in AutoScalerEventHandler conflict with
> handlerScalingFailure/handlerScalingReport (one of the
> handles the event of scale failure, and the other handles
> the event of scale success).
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Rui Fan<1996fan...@gmail.com> |
> | Date | 08/21/2023 17:41 |
> | To |  |
> | Cc | Maximilian Michels ,
> Gyula Fóra ,
> Matt Wang |
> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes |
> Hi Max, Gyula and Matt,
>
> Do you have any other comments?
>
> The flink-kubernetes-operator 1.6 has been released recently,
> it's a good time to kick off this FLIP.
>
> Please let me know if you have any questions or concerns,
> looking forward to your feedback, thanks!
>
> Best,
> Rui
>
> On Wed, Aug 9, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Matt Wang,
>
> Thanks for your discussion here.
>
> it is recommended to unify the descriptions of AutoScalerHandler
> and AutoScalerEventHandler in the FLIP
>
> Good catch, I have updated all AutoScalerHandler to
> AutoScalerEventHandler.
>
> Can it support the use of zookeeper (zookeeper is a relatively
> common use of flink HA)?
>
> In my opinion, it's a good suggestion. However, I prefer we
> implement other state stores in the other FLINK JIRA, and
> this FLIP focus on the decoupling and implementing the
> necessary state store. Does that make sense?
>
> Regarding each scaling information, can it be persisted in
> the shared file system through the filesystem? I think it will
> be a more valuable requirement to support viewing
> Autoscaling info on the UI in the future, which can provide
> some foundations in advance;
>
> This is a good suggestion as well. It's useful for users to check
> the scaling information. I propose to add a CompositeEventHandler,
> it can include multiple EventHandlers.
>
> However, as the last question, I prefer we implement other
> event handler in the other FLINK JIRA. What do you think?
>
> A solution mentioned in FLIP is to initialize the
> AutoScalerEventHandler object every time an event is
> processed.
>
> No, the FLIP mentioned `The AutoScalerEventHandler  object is shared for
> all flink jobs`,
> So the AutoScalerEventHandler is only initialized once.
>
> And we call the AutoScalerEventHandler#handlerXXX
> every time an event is processed.
>
> Best,
> Rui
>
> On Tue, Aug 8, 2023 at 9:40 PM Matt Wang  wrote:
>
> Hi Rui
>
> Thanks for driving the FLIP.
>
> I agree with the point fo this FLIP. This FLIP first provides a
> general function of Autoscaler in Flink repo, and there is no
> need to move kubernetes-autoscaler from kubernetes-operator
> to Flink repo in this FLIP(it is recommended to unify the
> descriptions of AutoScalerHandler and AutoScalerEventHandler
> in the FLIP). Here I still have a few questions:
>
> 1. AutoScalerStateStore mainly records the state information
> during Scaling. In addition to supporting the use of configmap,
> can it support the use of zookeeper (zookeeper is a relatively
> common use of flink HA)?
> 2. Regarding each scaling information, can it be persisted in
> the shared file system through the filesystem? I think it will
> be a more valuable requirement to support viewing
> Autoscaling info on the UI in the future, which can provide
> some foundations in advance;
> 3. A solution mentioned in FLIP is to initialize the
> AutoScalerEventHandler object every time an event is
> processed. What is the main purpose of this solution?
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Rui Fan<1996fan...@gmail.com> |

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Rui Fan

Hi Gyula,

+1 for this proposal. The current GC metric is really unfriendly.

I have a concern with your proposed rate metric: the rate is perSecond
instead of per minute. I'm unsure whether it's suitable for GC metric.

There are two reasons why I suspect perSecond may not be well
compatible with GC metric:

1. GCs are usually infrequent and may only occur for a small number
of time periods within a minute.

Metrics are collected periodically, for example, reported every minute.
If the result reported by the GC metric is 1s/perSecond, it does not
mean that the GC of the TM is serious, because there may be no GC
in the remaining 59s.

On the contrary, the GC metric reports 0s/perSecond, which does not
mean that the GC of the TM is not serious, and the GC may be very
serious in the remaining 59s.

2. Stop-the-world may cause the metric to fail(delay) to report

The TM will stop the world during GC, especially full GC. It means
the metric cannot be collected or reported during full GC.

So the collected GC metric may never be 1s/perSecond. This metric
may always be good because the metric will only be reported when
the GC is not severe.

If these concerns make sense, how about updating the GC rate
at minute level?

We can define the type to Gauge for TimeMsPerMiunte, and updating
this Gauge every second, it is:
GC Total.Time of current time - GC total time of one miunte ago.

Best,
Rui

On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels  wrote:

> Hi Gyula,
>
> +1 The proposed changes make sense and are in line with what is
> available for other metrics, e.g. number of records processed.
>
> -Max
>
> On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra  wrote:
> >
> > Hi Devs,
> >
> > I would like to start a discussion on FLIP-361: Improve GC Metrics [1].
> >
> > The current Flink GC metrics [2] are not very useful for monitoring
> > purposes as they require post processing logic that is also dependent on
> > the current runtime environment.
> >
> > Problems:
> >  - Total time is not very relevant for long running applications, only
> the
> > rate of change (msPerSec)
> >  - In most cases it's best to simply aggregate the time/count across the
> > different GabrageCollectors, however the specific collectors are
> dependent
> > on the current Java runtime
> >
> > We propose to improve the current situation by:
> >  - Exposing rate metrics per GarbageCollector
> >  - Exposing aggregated Total time/count/rate metrics
> >
> > These new metrics are all derived from the existing ones with minimal
> > overhead.
> >
> > Looking forward to your feedback.
> >
> > Cheers,
> > Gyula
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > [2]
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Rui Fan

Thanks for the clarification!

By default the meterview measures for 1 minute sounds good to me!

+1 for this proposal.

Best,
Rui

On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra  wrote:

> Thanks for the feedback Rui,
>
> The rates would be computed using the MeterView class (like for any other
> rate metric), just because we report the value per second it doesn't mean
> that we measure in a second granularity.
> By default the meterview measures for 1 minute and then we calculate the
> per second rates, but we can increase the timespan if necessary.
>
> So I don't think we run into this problem in practice and we can keep the
> metric aligned with other time rate metrics like busyTimeMsPerSec etc.
>
> Cheers,
> Gyula
>
> On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Gyula,
> >
> > +1 for this proposal. The current GC metric is really unfriendly.
> >
> > I have a concern with your proposed rate metric: the rate is perSecond
> > instead of per minute. I'm unsure whether it's suitable for GC metric.
> >
> > There are two reasons why I suspect perSecond may not be well
> > compatible with GC metric:
> >
> > 1. GCs are usually infrequent and may only occur for a small number
> > of time periods within a minute.
> >
> > Metrics are collected periodically, for example, reported every minute.
> > If the result reported by the GC metric is 1s/perSecond, it does not
> > mean that the GC of the TM is serious, because there may be no GC
> > in the remaining 59s.
> >
> > On the contrary, the GC metric reports 0s/perSecond, which does not
> > mean that the GC of the TM is not serious, and the GC may be very
> > serious in the remaining 59s.
> >
> > 2. Stop-the-world may cause the metric to fail(delay) to report
> >
> > The TM will stop the world during GC, especially full GC. It means
> > the metric cannot be collected or reported during full GC.
> >
> > So the collected GC metric may never be 1s/perSecond. This metric
> > may always be good because the metric will only be reported when
> > the GC is not severe.
> >
> >
> > If these concerns make sense, how about updating the GC rate
> > at minute level?
> >
> > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > this Gauge every second, it is:
> > GC Total.Time of current time - GC total time of one miunte ago.
> >
> > Best,
> > Rui
> >
> > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels 
> wrote:
> >
> > > Hi Gyula,
> > >
> > > +1 The proposed changes make sense and are in line with what is
> > > available for other metrics, e.g. number of records processed.
> > >
> > > -Max
> > >
> > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra 
> wrote:
> > > >
> > > > Hi Devs,
> > > >
> > > > I would like to start a discussion on FLIP-361: Improve GC Metrics
> [1].
> > > >
> > > > The current Flink GC metrics [2] are not very useful for monitoring
> > > > purposes as they require post processing logic that is also dependent
> > on
> > > > the current runtime environment.
> > > >
> > > > Problems:
> > > >  - Total time is not very relevant for long running applications,
> only
> > > the
> > > > rate of change (msPerSec)
> > > >  - In most cases it's best to simply aggregate the time/count across
> > the
> > > > different GabrageCollectors, however the specific collectors are
> > > dependent
> > > > on the current Java runtime
> > > >
> > > > We propose to improve the current situation by:
> > > >  - Exposing rate metrics per GarbageCollector
> > > >  - Exposing aggregated Total time/count/rate metrics
> > > >
> > > > These new metrics are all derived from the existing ones with minimal
> > > > overhead.
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > > [2]
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> > >
> >
>

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-09-06 Thread Rui Fan

Hi Max,

As the FLIP mentioned, we have the plan to add the
alternative implementation.

First of all, we will develop a generic autoscaler. This generic
autoscaler will not have knowledge of specific jobs, and users
will have the flexibility to pass the JobAutoScalerContext
when utilizing the generic autoscaler. Communication with
Flink jobs can be achieved through the RestClusterClient.

   - The generic ScalingRealizer is based on the rescale API (FLIP-291).
   - The generic EventHandler is based on the logger.
   - The generic StateStore is based on the Heap. This means that the state
   information is stored in memory and can be lost if the autoscaler restarts.


Secondly, for yarn implementation, as Samrat mentioned,
There is currently no flink-yarn-operator, and we cannot
easily obtain the job list. We are not yet sure how to manage
yarn's flink jobs. In order to prevent the FLIP from being too huge,
after confirming with Gyula and Samrat before, it is decided
that the current FLIP will not implement the automated
yarn-autoscaler. And it will be a separate FLIP in the future.


After this part is finished, flink users or other flink platforms can easy
to use the autoscaler, they just pass the Context, and the autoscaler
can find the flink job using the RestClient.

The first part will be done in this FLIP. And we can discuss
whether the second part should be done in this FLIP as well.

Best,
Rui

On Wed, Sep 6, 2023 at 4:34 AM Samrat Deb  wrote:

> Hi Max,
>
> > are we planning to add an alternative implementation
> against the new interfaces?
>
> Yes, we are simultaneously working on the YARN implementation using the
> interface. During the initial interface design, we encountered some
> anomalies while implementing it in YARN.
>
> Once the interfaces are finalized, we will proceed to raise a pull request
> (PR) for YARN as well.
>
> Our initial approach was to create a decoupled interface as part of
> FLIP-334 and then implement it for YARN in the subsequent phase.
> However, if you recommend combining both phases, we can certainly consider
> that option.
>
> We look forward to hearing your thoughts on whether to have YARN
> implementation as part of FLIP-334 or seperate one ?
>
> Bests
> Samrat
>
>
>
> On Tue, Sep 5, 2023 at 8:41 PM Maximilian Michels  wrote:
>
> > Thanks Rui for the update!
> >
> > Alongside with the refactoring to decouple autoscaler logic from the
> > deployment logic, are we planning to add an alternative implementation
> > against the new interfaces? I think the best way to get the interfaces
> > right, is to have an alternative implementation in addition to
> > Kubernetes. YARN or a standalone mode implementation were already
> > mentioned. Ultimately, this is the reason we are doing the
> > refactoring. Without a new implementation, it becomes harder to
> > justify the refactoring work.
> >
> > Cheers,
> > Max
> >
> > On Tue, Sep 5, 2023 at 9:48 AM Rui Fan  wrote:
> > >
> > > After discussing this FLIP-334[1] offline with Gyula and Max,
> > > I updated the FLIP based on the latest conclusion.
> > >
> > > Big thanks to Gyula and Max for their professional advice!
> > >
> > > > Does the interface function of handlerRecommendedParallelism
> > > > in AutoScalerEventHandler conflict with
> > > > handlerScalingFailure/handlerScalingReport (one of the
> > > > handles the event of scale failure, and the other handles
> > > > the event of scale success).
> > > Hi Matt,
> > >
> > > You can take a look at the FLIP, I think the issue has been fixed.
> > > Currently, we introduced the ScalingRealizer and
> > > AutoScalerEventHandler interface.
> > >
> > > The ScalingRealizer handles scaling action.
> > >
> > > The AutoScalerEventHandler  interface handles loggable events.
> > >
> > >
> > > Looking forward to your feedback, thanks!
> > >
> > > [1] https://cwiki.apache.org/confluence/x/x4qzDw
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Aug 24, 2023 at 10:55 AM Matt Wang  wrote:
> > >>
> > >> Sorry for the late reply, I still have a small question here:
> > >> Does the interface function of handlerRecommendedParallelism
> > >> in AutoScalerEventHandler conflict with
> > >> handlerScalingFailure/handlerScalingReport (one of the
> > >> handles the event of scale failure, and the other handles
> > >> the event of scale success).
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> Best,

Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-09-06 Thread Rui Fan

Hi Max,

Thanks for your feedback!

> We need to go through these phases for the FLIP to be meaningful:
> 1. Decouple autoscaler from current autoscaler (generalization)
> 2. Ensure 100% functionality and test coverage of Kubernetes
implementation
> 3. Interface with another backend (e.g. YARN or standalone)

These phases make sense to me.

I have updated the FLIP[1] based on our offline discussion.
I added the `5. Standalone AutoScaler` to explain the generic
autoscaler.

Please correct me if anything is wrong or missed, thanks again!

[1] https://cwiki.apache.org/confluence/x/x4qzDw

Best,
Rui

On Wed, Sep 6, 2023 at 7:00 PM Maximilian Michels  wrote:

> Hey Rui, hey Samrat,
>
> I want to ensure this is not just an exercise but has actual benefits
> for the community. In the past, I've seen that the effort stops half
> way through, the refactoring gets done with some regressions, but
> actual alternative implementations based on the new design never
> follow.
>
> We need to go through these phases for the FLIP to be meaningful:
>
> 1. Decouple autoscaler from current autoscaler (generalization)
> 2. Ensure 100% functionality and test coverage of Kubernetes implementation
> 3. Interface with another backend (e.g. YARN or standalone)
>
> If we don't follow through with this plan, I'm not sure we are better
> off than with the current implementation. Apologies if I'm being a bit
> strict here but the autoscaling code has become a critical
> infrastructure component. We need to carefully weigh the pros and cons
> here to avoid risks for our users, some of them using this code in
> production and relying on it on a day to day basis.
>
> That said, we are open to following through with the FLIP and we can
> definitely help review code changes and build on the new design.
>
> -Max
>
>
> On Wed, Sep 6, 2023 at 11:26 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Max,
> >
> > As the FLIP mentioned, we have the plan to add the
> > alternative implementation.
> >
> > First of all, we will develop a generic autoscaler. This generic
> > autoscaler will not have knowledge of specific jobs, and users
> > will have the flexibility to pass the JobAutoScalerContext
> > when utilizing the generic autoscaler. Communication with
> > Flink jobs can be achieved through the RestClusterClient.
> >
> >- The generic ScalingRealizer is based on the rescale API (FLIP-291).
> >- The generic EventHandler is based on the logger.
> >- The generic StateStore is based on the Heap. This means that the
> state
> >information is stored in memory and can be lost if the autoscaler
> restarts.
> >
> >
> > Secondly, for yarn implementation, as Samrat mentioned,
> > There is currently no flink-yarn-operator, and we cannot
> > easily obtain the job list. We are not yet sure how to manage
> > yarn's flink jobs. In order to prevent the FLIP from being too huge,
> > after confirming with Gyula and Samrat before, it is decided
> > that the current FLIP will not implement the automated
> > yarn-autoscaler. And it will be a separate FLIP in the future.
> >
> >
> > After this part is finished, flink users or other flink platforms can
> easy
> > to use the autoscaler, they just pass the Context, and the autoscaler
> > can find the flink job using the RestClient.
> >
> > The first part will be done in this FLIP. And we can discuss
> > whether the second part should be done in this FLIP as well.
> >
> > Best,
> > Rui
> >
> > On Wed, Sep 6, 2023 at 4:34 AM Samrat Deb  wrote:
> >
> > > Hi Max,
> > >
> > > > are we planning to add an alternative implementation
> > > against the new interfaces?
> > >
> > > Yes, we are simultaneously working on the YARN implementation using the
> > > interface. During the initial interface design, we encountered some
> > > anomalies while implementing it in YARN.
> > >
> > > Once the interfaces are finalized, we will proceed to raise a pull
> request
> > > (PR) for YARN as well.
> > >
> > > Our initial approach was to create a decoupled interface as part of
> > > FLIP-334 and then implement it for YARN in the subsequent phase.
> > > However, if you recommend combining both phases, we can certainly
> consider
> > > that option.
> > >
> > > We look forward to hearing your thoughts on whether to have YARN
> > > implementation as part of FLIP-334 or seperate one ?
> > >
> > > Bests
> > > Samrat
> > >
> > >
> > &

Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-10 Thread Rui Fan

Thanks Zhanghao driving this FLIP, adding the port in Web UI
seems good to me.

Hi Shammon and Zhanghao,

I would like to clarify the difference between Public Interfaces
in FLIP and @Public in code.

As I understand, the `Public Interfaces in FLIP` means these
changes will be used in user side, such as: @Public class,
Configuration settings, User-facing scripts/command-line tools,
and rest api, etc.

You can refer to  "What are the "public interfaces" of the project?"
part in Flink Improvement Proposals doc[1].

@Public class means the user will use this class directly, and
these rest classes won't be depended on directly. So I think
these classes related to rest don't need to be marked @Public.

Please correct me if anything is wrong, thanks~

[1]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Best,
Rui

On Mon, Sep 11, 2023 at 11:09 AM Weihua Hu  wrote:

> Hi, Zhanghao
>
> Thanks for bringing this proposal.
>
> I have a concern:
>
> I prefer to keep the "host" field and add a "location" field in future
> versions.
> Consider a scenario where a machine (host) with multiple TaskManagers has
> poor processing performance due to some problems.
> By using a host field aggregation, I can identify the problems with this
> machine and take it offline.
>
> Best,
> Weihua
>
>
> On Mon, Sep 11, 2023 at 10:34 AM Chen Zhanghao 
> wrote:
>
> > Hi Shammon,
> >
> > I think all REST API response messages (e.g.
> > SubtaskExecutionAttemptDetailsInfo) should be considered as part of the
> > public APIs and therefore be marked as @Public. It is true though none of
> > them are marked as @public yet. Maybe we should do that. ccing
> > @chesnay for confirmation.
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: Shammon FY 
> > 发送时间: 2023年9月11日 10:22
> > 收件人: dev@flink.apache.org 
> > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> > Location in REST API and Web UI
> >
> > Thanks Zhanghao for initialing this discussion, I have just one comment:
> >
> > I checked the classes `SubtasksAllAccumulatorsHandler`,
> > `SubtasksTimesHandler`, `SubtaskCurrentAttemptDetailsHandler`,
> > `JobVertexTaskManagersHandler` and `JobExceptionsHandler` you mentioned
> in
> > `Public Interfaces` and they are not annotated as `Public`. So do you
> want
> > to annotate them as `Plublic`? If not, I think you may need to move them
> > from `Public Interfaces` to `Proposed Changes`.
> >
> > Best,
> > Shammon FY
> >
> > On Sat, Sep 9, 2023 at 12:11 PM Chen Zhanghao  >
> > wrote:
> >
> > > Hi Devs,
> > >
> > > I would like to start a discussion on FLIP-363: Unify the
> Representation
> > > of TaskManager Location in REST API and Web UI [1].
> > >
> > > The TaskManager location of subtasks is important for identifying
> > > TM-related problems. There are a number of places in REST API and Web
> UI
> > > where TaskManager location is returned/displayed.
> > >
> > > Problems:
> > >
> > >   *   Only hostname is provided to represent TaskManager location in
> some
> > > places (e.g. SubtaskCurrentAttemptDetailsHandler). However, in a
> > > containerized era, it is common to have multiple TMs on the same host,
> > and
> > > port info is crucial to distinguish different TMs.
> > >   *   Inconsistent naming of the field to represent TaskManager
> location:
> > > "host" is used in most places but "location" is also used in
> > > JobExceptions-related places.
> > >   *   Inconsistent semantics of the "host" field: The semantics of the
> > > host field are inconsistent, sometimes it denotes hostname only while
> in
> > > other times it denotes hostname + port (which is also inconsistent with
> > the
> > > name of "host").
> > >
> > > We propose to improve the current situation by:
> > >
> > >   *   Use a field named "location" that represents TaskManager location
> > in
> > > the form of "${hostname}:${port}" in a consistent manner across REST
> APIs
> > > and the front-end.
> > >   *   Rename the column name from "Host" to "Location" on the Web UI to
> > > reflect the change that both hostname and port are displayed.
> > >   *   Keep the old "host" fields untouched for compatibility. They can
> be
> > > removed in the next major version.
> > >
> > > Looking forward to your feedback.
> > >
> > > [1] FLIP-363: Unify the Representation of TaskManager Location in REST
> > API
> > > and Web UI - Apache Flink - Apache Software Foundation<
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-363%3A+Unify+the+Representation+of+TaskManager+Location+in+REST+API+and+Web+UI
> > > >
> > >
> > > Best,
> > > Zhanghao Chen
> > >
> >
>

[VOTE] FLIP-334: Decoupling autoscaler and kubernetes and support the Standalone Autoscaler

2023-09-12 Thread Rui Fan

Hi all,

Thanks for all the feedback about the FLIP-334:
Decoupling autoscaler and kubernetes and
support the Standalone Autoscaler[1].
This FLIP was discussed in [2].

I'd like to start a vote for it. The vote will be open for at least 72
hours (until Sep 16th 11:00 UTC+8) unless there is an objection or
insufficient votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-334+%3A+Decoupling+autoscaler+and+kubernetes+and+support+the+Standalone+Autoscaler
[2] https://lists.apache.org/thread/kmm03gls1vw4x6vk1ypr9ny9q9522495

Best,
Rui

Re: [VOTE] FLIP-361: Improve GC Metrics

2023-09-13 Thread Rui Fan

+1(binding)

Best,
Rui

On Wed, Sep 13, 2023 at 9:16 PM Gyula Fóra  wrote:

> Hi All!
>
> Thanks for all the feedback on FLIP-361: Improve GC Metrics [1][2]
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or insufficient votes.
>
> Cheers,
> Gyula
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> [2] https://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2
>

Re: [VOTE] FLIP-334: Decoupling autoscaler and kubernetes and support the Standalone Autoscaler

2023-09-15 Thread Rui Fan

+1(binding)

Best,
Rui

On Thu, Sep 14, 2023 at 10:06 AM liu ron  wrote:

> +1(non-binding)
>
> Best,
> Ron
>
> Dong Lin  于2023年9月14日周四 09:01写道：
>
> > Thank you Rui for the proposal.
> >
> > +1 (binding)
> >
> > On Wed, Sep 13, 2023 at 10:52 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Thanks for all the feedback about the FLIP-334:
> > > Decoupling autoscaler and kubernetes and
> > > support the Standalone Autoscaler[1].
> > > This FLIP was discussed in [2].
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours (until Sep 16th 11:00 UTC+8) unless there is an objection or
> > > insufficient votes.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-334+%3A+Decoupling+autoscaler+and+kubernetes+and+support+the+Standalone+Autoscaler
> > > [2] https://lists.apache.org/thread/kmm03gls1vw4x6vk1ypr9ny9q9522495
> > >
> > > Best,
> > > Rui
> > >
> >
>

[RESULT][VOTE] FLIP-334: Decoupling autoscaler and kubernetes and support the Standalone Autoscaler

2023-09-15 Thread Rui Fan

Hi all,

Thanks everyone for your review and the votes!

I am happy to announce that FLIP-334:
Decoupling autoscaler and kubernetes and
support the Standalone Autoscaler[1]
has been accepted.

There are 4 binding votes and 8 non-binding vote [2]:

- Gyula Fora (binding)
- Maximilian Michels (binding)
- Dong Lin (binding)
- Rui Fan(binding)
- Ahmed Hamdy
- ConradJam
- Matt Wang
- Ferenc Csaky
- Zhanghao Chen
- Feng Jin
- Samrat Deb
- Ron Liu

There is no disapproving vote.

[1] https://cwiki.apache.org/confluence/x/x4qzDw
[2] https://lists.apache.org/thread/3wmhhqgkkg1l7ncxnzwqnjqyhqz545gl

Best,
Rui

Re: [VOTE] FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-17 Thread Rui Fan

A gentle reminder about the location naming.
The naming of location is a little unclear, but
I can't think of any other better naming.

So I +1(binding) first.

Ping @Jing Ge  to help double check the name again.

Sorry for mentioning naming in the VOTE thread,
I didn't know this VOTE would be so early.

Best,
Rui

On Mon, Sep 18, 2023 at 11:44 AM Yangze Guo  wrote:

> +1 (binding)
>
> Best,
> Yangze Guo
>
> On Mon, Sep 18, 2023 at 11:37 AM Chen Zhanghao
>  wrote:
> >
> > Hi All,
> >
> > Thanks for all the feedback on FLIP-363: Unify the Representation of
> TaskManager Location in REST API and Web UI [1][2]
> >
> > I'd like to start a vote for FLIP-363. The vote will be open for at
> least 72 hours unless there is an objection or insufficient votes.
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-363%3A+Unify+the+Representation+of+TaskManager+Location+in+REST+API+and+Web+UI
> > [2] https://lists.apache.org/thread/sls1196mmk25w8nm2qf585254nbjr9hd
> >
> > Best,
> > Zhanghao Chen
>

Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-17 Thread Rui Fan

Hi Zhanghao,

> Use a field named "location" (already used in JobExceptionsInfoWithHistory)
that represents TaskManager location using the newly added string formatter
method.

How about using the taskManagerLocation instead of location
for all Rest response related classes?

The taskManagerLocation may be clearer.

Best,
Rui


On Mon, Sep 18, 2023 at 10:11 AM Chen Zhanghao 
wrote:

> Hi all,
>
> I've updated the FLIP to incorporate Yangze's advice:
>
> 1. Add a new string formatter method to TaskManagerLocation and
> ArchivedTaskManagerLocation that prints in the form of
> "${hostname}:${port}" to align the string formatter used by REST API.
> 2. Highlight that the old host field will be kept for at least 2 minor
> versions before removal.
>
> Best,
> Zhanghao Chen
> 
> 发件人: Yangze Guo 
> 发送时间: 2023年9月15日 17:26
> 收件人: dev@flink.apache.org 
> 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> Location in REST API and Web UI
>
> Thanks for driving this, Zhanghao. +1 for the overall proposal.
>
> Some cents from my side:
>
> 1. Since most of the rest api get the location from
> TaskManagerLocation, we can align the string formatter in this class.
> E.g. add something like toHumanRealableString to this class.
>
> 2. From my understanding of FLIP-321, if we want to deprecate the host
> field, we should annotate the related field / getter / setter with
> @Deprecated in this version, and keep them at least 2 minor releases.
>
> Best,
> Yangze Guo
>
> On Wed, Sep 13, 2023 at 8:52 PM Chen Zhanghao 
> wrote:
> >
> > Hi Jing,
> >
> > Thanks for the suggestion. Endpoint is indeed a more professional word
> in the networking world but I think location is more suited here for two
> reasons:
> >
> >   1.  The term here is for uniquely identifying the TaskManager where
> the task is deployed while providing the host machine info as well to help
> identify taskmanager- and host-aggregative problems. So strictly speaking,
> it is not used in a pure networking context.
> >   2.  The term "location" is already used widely in the codebase, e.g.
> TaskManagerLocation and JobExceptions-related classes.
> >
> > WDYT?
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: Jing Ge 
> > 发送时间: 2023年9月13日 4:52
> > 收件人: dev@flink.apache.org 
> > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> Location in REST API and Web UI
> >
> > Hi Zhanghao,
> >
> > Thanks for bringing this to our attention. It is a good proposal to
> improve
> > data consistency.
> >
> > Speaking of naming conventions of choosing location over host, how about
> > "endpoint" with the following thoughts:
> >
> > 1. endpoint is a more professional word than location in the network
> > context.
> > 2. I know commonly endpoints mean the URLs of services. Using
> Hostname:port
> > as the endpoint follows exactly the same rule, because TaskManager is the
> > top level service that aligns with the top level endpoint.
> >
> > WDYT?
> >
> > Best regards,
> > Jing
> >
> >
> > On Mon, Sep 11, 2023 at 6:01 AM Weihua Hu 
> wrote:
> >
> > > Hi, Zhanghao
> > >
> > > Since the meaning of "host" is not aligned, it seems good for me to
> remove
> > > it in the future release.
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Mon, Sep 11, 2023 at 11:48 AM Chen Zhanghao <
> zhanghao.c...@outlook.com>
> > > wrote:
> > >
> > > > Hi Fan Rui,
> > > >
> > > > Thanks for clarifying the definition of "public interfaces", that
> helps a
> > > > lot!
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > 
> > > > 发件人: Rui Fan <1996fan...@gmail.com>
> > > > 发送时间: 2023年9月11日 11:18
> > > > 收件人: dev@flink.apache.org 
> > > > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> > > > Location in REST API and Web UI
> > > >
> > > > Thanks Zhanghao driving this FLIP, adding the port in Web UI
> > > > seems good to me.
> > > >
> > > > Hi Shammon and Zhanghao,
> > > >
> > > > I would like to clarify the difference between Public Interfaces
> > > > in FLIP and @Public in code.
> > > >
> > > > As I understand, t

Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-18 Thread Rui Fan

Hi Yangze,

Thanks for your reply!

For a subtask, the taskManagerLocation is the physical location.
And the task id, operator id and subtask index may be the
logical locations of a subtask.

Actually, I don't think the taskManagerLocation is
a good name here.

Unlike task id, operator id, and subtask index, both location
and taskManagerLocation are ambiguous. subtask_index is
an unambiguous name in the Flink world.

That's why I think it's not a good name, especially we want to
unify the TaskManager Location in REST API and Web UI.

Of course, if we can't define a better name then location is fine
with me, thank you~

Best,
Rui

On Mon, Sep 18, 2023 at 9:06 PM Yangze Guo  wrote:

> Hi, Rui,
>
> I think the term "location" might be sufficiently clear in the
> specific context, e.g. SubtaskXXXInfo and TaskManagerXXXInfo. Could
> you elaborate more on what concept you think one might confuse it
> with?
>
> Best,
> Yangze Guo
>
> On Mon, Sep 18, 2023 at 12:07 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Zhanghao,
> >
> > > Use a field named "location" (already used in
> JobExceptionsInfoWithHistory)
> > that represents TaskManager location using the newly added string
> formatter
> > method.
> >
> > How about using the taskManagerLocation instead of location
> > for all Rest response related classes?
> >
> > The taskManagerLocation may be clearer.
> >
> > Best,
> > Rui
> >
> >
> > On Mon, Sep 18, 2023 at 10:11 AM Chen Zhanghao <
> zhanghao.c...@outlook.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I've updated the FLIP to incorporate Yangze's advice:
> > >
> > > 1. Add a new string formatter method to TaskManagerLocation and
> > > ArchivedTaskManagerLocation that prints in the form of
> > > "${hostname}:${port}" to align the string formatter used by REST API.
> > > 2. Highlight that the old host field will be kept for at least 2 minor
> > > versions before removal.
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > 发件人: Yangze Guo 
> > > 发送时间: 2023年9月15日 17:26
> > > 收件人: dev@flink.apache.org 
> > > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> > > Location in REST API and Web UI
> > >
> > > Thanks for driving this, Zhanghao. +1 for the overall proposal.
> > >
> > > Some cents from my side:
> > >
> > > 1. Since most of the rest api get the location from
> > > TaskManagerLocation, we can align the string formatter in this class.
> > > E.g. add something like toHumanRealableString to this class.
> > >
> > > 2. From my understanding of FLIP-321, if we want to deprecate the host
> > > field, we should annotate the related field / getter / setter with
> > > @Deprecated in this version, and keep them at least 2 minor releases.
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Wed, Sep 13, 2023 at 8:52 PM Chen Zhanghao <
> zhanghao.c...@outlook.com>
> > > wrote:
> > > >
> > > > Hi Jing,
> > > >
> > > > Thanks for the suggestion. Endpoint is indeed a more professional
> word
> > > in the networking world but I think location is more suited here for
> two
> > > reasons:
> > > >
> > > >   1.  The term here is for uniquely identifying the TaskManager where
> > > the task is deployed while providing the host machine info as well to
> help
> > > identify taskmanager- and host-aggregative problems. So strictly
> speaking,
> > > it is not used in a pure networking context.
> > > >   2.  The term "location" is already used widely in the codebase,
> e.g.
> > > TaskManagerLocation and JobExceptions-related classes.
> > > >
> > > > WDYT?
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > 
> > > > 发件人: Jing Ge 
> > > > 发送时间: 2023年9月13日 4:52
> > > > 收件人: dev@flink.apache.org 
> > > > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> > > Location in REST API and Web UI
> > > >
> > > > Hi Zhanghao,
> > > >
> > > > Thanks for bringing this to our attention. It is a good proposal to
> > > improve
> > > > data consistency.
> > > >
> > > > Speaking of naming conventions of

Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-18 Thread Rui Fan

+1, thanks everyone who discussed here.

Best,
Rui


On Tue, 19 Sep 2023 at 11:41, Chen Zhanghao 
wrote:

> Hi Jing,
>
> Thanks for the clarification, I now see the point. +1 for using endpoint
> now. @fan...@apache.org  WDYT?
>
> Best,
> Zhanghao Chen
> --
> *发件人:* Yangze Guo 
> *发送时间:* 2023年9月19日 11:18
>
> *收件人:* dev@flink.apache.org 
> *主题:* Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> Location in REST API and Web UI
>
> Thanks for the clarification, Jing. I agree that using another term
> like endpoint can help us to distinguish it from the existing concept
> of "location". +1 for using the term endpoint and introducing
> TaskManagerLocation.getEndpoint().
>
> Best,
> Yangze Guo
>
> On Mon, Sep 18, 2023 at 11:52 PM Jing Ge 
> wrote:
> >
> > Hi Zhanghao,
> >
> > That is exactly the reason why location should not be used, because there
> > is a clear definition of location in Flink, e.g. TaskManagerLocation
> which
> > contains more information than hostname+port. If you think endpoint is
> too
> > generic, how about locationEndpoint? But if we build that format logic
> into
> > Location classes, it will look like
> > TaskManagerLocation.getLocationEndpoint() with redundant "location".
> > TaskManagerLocation.getEndpoint() is better.
> > TaskManagerLocation.getLocation(),
> > TaskManagerLocation.getLocationAsString(), or similar names in that
> > direction are even worse.
> >
> > Best regards,
> > Jing
> >
> > On Wed, Sep 13, 2023 at 2:52 PM Chen Zhanghao  >
> > wrote:
> >
> > > Hi Jing,
> > >
> > > Thanks for the suggestion. Endpoint is indeed a more professional word
> in
> > > the networking world but I think location is more suited here for two
> > > reasons:
> > >
> > >   1.  The term here is for uniquely identifying the TaskManager where
> the
> > > task is deployed while providing the host machine info as well to help
> > > identify taskmanager- and host-aggregative problems. So strictly
> speaking,
> > > it is not used in a pure networking context.
> > >   2.  The term "location" is already used widely in the codebase, e.g.
> > > TaskManagerLocation and JobExceptions-related classes.
> > >
> > > WDYT?
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > 发件人: Jing Ge 
> > > 发送时间: 2023年9月13日 4:52
> > > 收件人: dev@flink.apache.org 
> > > 主题: Re: [DISCUSS] FLIP-363: Unify the Representation of TaskManager
> > > Location in REST API and Web UI
> > >
> > > Hi Zhanghao,
> > >
> > > Thanks for bringing this to our attention. It is a good proposal to
> improve
> > > data consistency.
> > >
> > > Speaking of naming conventions of choosing location over host, how
> about
> > > "endpoint" with the following thoughts:
> > >
> > > 1. endpoint is a more professional word than location in the network
> > > context.
> > > 2. I know commonly endpoints mean the URLs of services. Using
> Hostname:port
> > > as the endpoint follows exactly the same rule, because TaskManager is
> the
> > > top level service that aligns with the top level endpoint.
> > >
> > > WDYT?
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Mon, Sep 11, 2023 at 6:01 AM Weihua Hu 
> wrote:
> > >
> > > > Hi, Zhanghao
> > > >
> > > > Since the meaning of "host" is not aligned, it seems good for me to
> > > remove
> > > > it in the future release.
> > > >
> > > > Best,
> > > > Weihua
> > > >
> > > >
> > > > On Mon, Sep 11, 2023 at 11:48 AM Chen Zhanghao <
> > > zhanghao.c...@outlook.com>
> > > > wrote:
> > > >
> > > > > Hi Fan Rui,
> > > > >
> > > > > Thanks for clarifying the definition of "public interfaces", that
> > > helps a
> > > > > lot!
> > > > >
> > > > > Best,
> > > > > Zhanghao Chen
> > > > > 
> > > > > 发件人: Rui Fan <1996fan...@gmail.com>
> > > > > 发送时间: 2023年9月11日 11:18
> > > > > 收件人: dev@flink.apache.org 
> > > > > 主题: Re: [DISCUSS] FLIP-363: Unify t

Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data

2023-09-22 Thread Rui Fan

+1(binding), thanks for driving this proposal.

Best,
Rui

On Fri, Sep 22, 2023 at 5:16 PM Jing Ge  wrote:

> +1(binding) Thanks!
>
> Best regards,
> Jing
>
> On Fri, Sep 22, 2023 at 8:08 AM Dong Lin  wrote:
>
> > Hi all,
> >
> > We would like to start the vote for FLIP-327: Support switching from
> batch
> > to stream mode to improve throughput when processing backlog data [1].
> This
> > FLIP was discussed in this thread [2].
> >
> > The vote will be open until at least Sep 27th (at least 72
> > hours), following the consensus voting process.
> >
> > Cheers,
> > Xuannan and Dong
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
> > [2] https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n
> >
>

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-09-27 Thread Rui Fan

Hi Zhu Zhu,

Thanks for your feedback here!

You are right, user needs to set 2 options:
- cluster.evenly-spread-out-slots=true
- slot.sharing-strategy=TASK_BALANCED_PREFERRED

Update it to one option is useful at user side, so
`taskmanager.load-balance.mode` sounds good to me.
I want to check some points and behaviors about this option:

1. The default value is None, right?
2. When it's set to Tasks, how to assign slots to TM?
- Option1: It's just check task number
- Option2: It''s check the slot number first, then check the
task number when the slot number is the same.

Giving an example to explain what's the difference between them:

- A session cluster has 2 flink jobs, they are jobA and jobB
- Each TM has 4 slots.
- The task number of one slot of jobA is 3
- The task number of one slot of jobB is 1
- We have 2 TaskManagers:
  - tm1 runs 3 slots of jobB, so tm1 runs 3 tasks
  - tm2 runs 1 slot of jobA, and 1 slot of jobB, so tm2 runs 4 tasks.

Now, we need to run a new slot, which tm should offer it?
- Option1: If we just check the task number, the tm1 is better.
- Option2: If we check the slot number first, and then check task, the tm2
is better

The original FLIP selected option2, that's why we didn't add the
third option. The option2 didn't break the semantics when
`cluster.evenly-spread-out-slots` is true, and it just improve the
behavior without the semantics is changed.

In the other hands, if we choose option2, when user set
`taskmanager.load-balance.mode` is Tasks. It also can achieve
the goal when it's Slots.

So I think the `Slots` enum isn't needed if we choose option2.
Of course, If we choose the option1, the enum is needed.

Looking forward to your feedback, thanks~

Best,
Rui

On Wed, Sep 27, 2023 at 9:11 PM Zhu Zhu  wrote:

> Thanks Yuepeng and Rui for creating this FLIP.
>
> +1 in general
> The idea is straight forward: best-effort gather all the slot requests
> and offered slots to form an overview before assigning slots, trying to
> balance the loads of task managers when assigning slots.
>
> I have one comment regarding the configuration for ease of use:
>
> IIUC, this FLIP uses an existing config 'cluster.evenly-spread-out-slots'
> as the main switch of the new feature. That is, from user perspective,
> with this improvement, the 'cluster.evenly-spread-out-slots' feature not
> only balances the number of slots on task managers, but also balances the
> number of tasks. This is a behavior change anyway. Besides that, it also
> requires users to set 'slot.sharing-strategy' to 'TASK_BALANCED_PREFERRED'
> to balance the tasks in each slot.
>
> I think we can introduce a new config option
> `taskmanager.load-balance.mode`,
> which accepts "None"/"Slots"/"Tasks". `cluster.evenly-spread-out-slots`
> can be superseded by the "Slots" mode and get deprecated. In the future
> it can support more mode, e.g. "CpuCores", to work better for jobs with
> fine-grained resources. The proposed config option
> `slot.request.max-interval`
> then can be renamed to
> `taskmanager.load-balance.request-stablizing-timeout`
> to show its relation with the feature. The proposed `slot.sharing-strategy`
> is not needed, because the configured "Tasks" mode will do the work.
>
> WDYT?
>
> Thanks,
> Zhu Zhu
>
> Yuepeng Pan  于2023年9月25日周一 16:26写道：
>
>> Hi all,
>>
>>
>> I and Fan Rui(CC’ed) created the FLIP-370[1] to support balanced tasks
>> scheduling.
>>
>>
>> The current strategy of Flink to deploy tasks sometimes leads some
>> TMs(TaskManagers) to have more tasks while others have fewer tasks,
>> resulting in excessive resource utilization at some TMs that contain more
>> tasks and becoming a bottleneck for the entire job processing. Developing
>> strategies to achieve task load balancing for TMs and reducing job
>> bottlenecks becomes very meaningful.
>>
>>
>> The raw design and discussions could be found in the Flink JIRA[2] and
>> Google doc[3]. We really appreciate Zhu Zhu(CC’ed) for providing some
>> valuable help and suggestions in advance.
>>
>>
>> Please refer to the FLIP[1] document for more details about the proposed
>> design and implementation. We welcome any feedback and opinions on this
>> proposal.
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling
>>
>> [2] https://issues.apache.org/jira/browse/FLINK-31757
>>
>> [3]
>> https://docs.google.com/document/d/14WhrSNGBdcsRl3IK7CZO-RaZ5KXU2X1dWqxPEFr3iS8
>>
>>
>> Best,
>>
>> Yuepeng Pan
>>
>

Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2023-09-28 Thread Rui Fan

+1(binding)

Best,
Rui

On Thu, 28 Sep 2023 at 14:41, Chen Zhanghao 
wrote:

> +1 (non-binding), thanks for driving this.
>
> Best,
> Zhanghao Chen
> 
> 发件人: Shammon FY 
> 发送时间: 2023年9月25日 13:28
> 收件人: dev 
> 主题: [VOTE] FLIP-314: Support Customized Job Lineage Listener
>
> Hi devs,
>
> Thanks for all the feedback on FLIP-314: Support Customized Job Lineage
> Listener [1] in thread [2].
>
> I would like to start a vote for it. The vote will be opened for at least
> 72 hours unless there is an objection or insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
> [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
>
> Best,
> Shammon FY
>

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-01 Thread Rui Fan

Hi Yangze,

Thanks for your feedback!

> 1. Is it possible for the SlotPool to get the slot allocation results
> from the SlotManager in advance instead of waiting for the actual
> physical slots to be registered, and perform pre-allocation? The
> benefit of doing this is to make the task deployment process smoother,
> especially when there are a large number of tasks in the job.

Could you elaborate on that? I didn't understand what's the benefit and
smoother.

> 2. If user enable the cluster.evenly-spread-out-slots, the issue in
> example 2 of section 2.2.3 can be resolved. Do I understand it
> correctly?

The example assigned result is the final allocation result when flink
user enables the cluster.evenly-spread-out-slots. We think the
assigned result is expected, so I think your understanding is right.

Best,
Rui

On Thu, Sep 28, 2023 at 1:10 PM Shammon FY  wrote:

> Thanks Yuepeng for initiating this discussion.
>
> +1 in general too, in fact we have implemented a similar mechanism
> internally to ensure a balanced allocation of tasks to slots, it works
> well.
>
> Some comments about the mechanism
>
> 1. This mechanism will be only supported in `SlotPool` or both `SlotPool`
> and `DeclarativeSlotPool`? Currently the two slot pools are used in
> different schedulers. I think this will also bring value to
> `DeclarativeSlotPool`, but currently FLIP content seems to be based on
> `SlotPool`, right?
>
> 2. In fine-grained resource management, we can set different resource
> requirements for different nodes, which means that the resources of each
> slot are different. What should be done when the slot selected by the
> round-robin strategy cannot meet the resource requirements? Will this lead
> to the failure of the balance strategy?
>
> 3. Is the assignment of tasks to slots balanced based on region or job
> level? When multiple TMs fail over, will it cause the balancing strategy to
> fail or even worse? What is the current processing strategy?
>
> For Zhuzhu and Rui:
>
> IIUC, the overall balance is divided into two parts: slot to TM and task to
> slot.
> 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> 2. Task to slot is guaranteed by the slot pool in JM
>
> These two are completely independent, what are the benefits of unifying
> these two into one option? Also, do we want to share the same
> option between SlotPool in JM and SlotManager in RM? This sounds a bit
> strange.
>
> Best,
> Shammon FY
>
>
>
> On Thu, Sep 28, 2023 at 12:08 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Zhu Zhu,
> >
> > Thanks for your feedback here!
> >
> > You are right, user needs to set 2 options:
> > - cluster.evenly-spread-out-slots=true
> > - slot.sharing-strategy=TASK_BALANCED_PREFERRED
> >
> > Update it to one option is useful at user side, so
> > `taskmanager.load-balance.mode` sounds good to me.
> > I want to check some points and behaviors about this option:
> >
> > 1. The default value is None, right?
> > 2. When it's set to Tasks, how to assign slots to TM?
> > - Option1: It's just check task number
> > - Option2: It''s check the slot number first, then check the
> > task number when the slot number is the same.
> >
> > Giving an example to explain what's the difference between them:
> >
> > - A session cluster has 2 flink jobs, they are jobA and jobB
> > - Each TM has 4 slots.
> > - The task number of one slot of jobA is 3
> > - The task number of one slot of jobB is 1
> > - We have 2 TaskManagers:
> >   - tm1 runs 3 slots of jobB, so tm1 runs 3 tasks
> >   - tm2 runs 1 slot of jobA, and 1 slot of jobB, so tm2 runs 4 tasks.
> >
> > Now, we need to run a new slot, which tm should offer it?
> > - Option1: If we just check the task number, the tm1 is better.
> > - Option2: If we check the slot number first, and then check task, the
> tm2
> > is better
> >
> > The original FLIP selected option2, that's why we didn't add the
> > third option. The option2 didn't break the semantics when
> > `cluster.evenly-spread-out-slots` is true, and it just improve the
> > behavior without the semantics is changed.
> >
> > In the other hands, if we choose option2, when user set
> > `taskmanager.load-balance.mode` is Tasks. It also can achieve
> > the goal when it's Slots.
> >
> > So I think the `Slots` enum isn't needed if we choose option2.
> > Of course, If we choose the option1, the enum is needed.
> >
> > Looking forward to your feedback, thanks~
> >
> > Best,
> > Rui
> >
> > On Wed, Sep

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-01 Thread Rui Fan

Hi Shammon,

Thanks for your feedback as well!

> IIUC, the overall balance is divided into two parts: slot to TM and task
to slot.
> 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> 2. Task to slot is guaranteed by the slot pool in JM
>
> These two are completely independent, what are the benefits of unifying
> these two into one option? Also, do we want to share the same
> option between SlotPool in JM and SlotManager in RM? This sounds a bit
> strange.

Your understanding is totally right, the balance needs 2 parts: slot to TM
and task to slot.

As I understand, the following are benefits of unifying them into one
option:

- Flink users don't care about these principles inside of flink, they don't
know these 2 parts.
- If flink provides 2 options, flink users need to set 2 options for their
job.
- If one option is missed, the final result may not be good. (Users may
have questions when using)
- If flink just provides 1 option, enabling one option is enough. (Reduce
the probability of misconfiguration)

Also, Flink’s options are user-oriented. Each option represents a switch or
parameter of a feature.
A feature may be composed of multiple components inside Flink.
It might be better to keep only one switch per feature.

Actually, the cluster.evenly-spread-out-slots option is used between
SlotPool in JM and SlotManager in RM. 2 components to ensure
this feature works well.

Please correct me if my understanding is wrong,
and looking forward to your feedback, thanks!

Best,
Rui

On Sun, Oct 1, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Yangze,
>
> Thanks for your feedback!
>
> > 1. Is it possible for the SlotPool to get the slot allocation results
> > from the SlotManager in advance instead of waiting for the actual
> > physical slots to be registered, and perform pre-allocation? The
> > benefit of doing this is to make the task deployment process smoother,
> > especially when there are a large number of tasks in the job.
>
> Could you elaborate on that? I didn't understand what's the benefit and
> smoother.
>
> > 2. If user enable the cluster.evenly-spread-out-slots, the issue in
> > example 2 of section 2.2.3 can be resolved. Do I understand it
> > correctly?
>
> The example assigned result is the final allocation result when flink
> user enables the cluster.evenly-spread-out-slots. We think the
> assigned result is expected, so I think your understanding is right.
>
> Best,
> Rui
>
> On Thu, Sep 28, 2023 at 1:10 PM Shammon FY  wrote:
>
>> Thanks Yuepeng for initiating this discussion.
>>
>> +1 in general too, in fact we have implemented a similar mechanism
>> internally to ensure a balanced allocation of tasks to slots, it works
>> well.
>>
>> Some comments about the mechanism
>>
>> 1. This mechanism will be only supported in `SlotPool` or both `SlotPool`
>> and `DeclarativeSlotPool`? Currently the two slot pools are used in
>> different schedulers. I think this will also bring value to
>> `DeclarativeSlotPool`, but currently FLIP content seems to be based on
>> `SlotPool`, right?
>>
>> 2. In fine-grained resource management, we can set different resource
>> requirements for different nodes, which means that the resources of each
>> slot are different. What should be done when the slot selected by the
>> round-robin strategy cannot meet the resource requirements? Will this lead
>> to the failure of the balance strategy?
>>
>> 3. Is the assignment of tasks to slots balanced based on region or job
>> level? When multiple TMs fail over, will it cause the balancing strategy
>> to
>> fail or even worse? What is the current processing strategy?
>>
>> For Zhuzhu and Rui:
>>
>> IIUC, the overall balance is divided into two parts: slot to TM and task
>> to
>> slot.
>> 1. Slot to TM is guaranteed by SlotManager in ResourceManager
>> 2. Task to slot is guaranteed by the slot pool in JM
>>
>> These two are completely independent, what are the benefits of unifying
>> these two into one option? Also, do we want to share the same
>> option between SlotPool in JM and SlotManager in RM? This sounds a bit
>> strange.
>>
>> Best,
>> Shammon FY
>>
>>
>>
>> On Thu, Sep 28, 2023 at 12:08 PM Rui Fan <1996fan...@gmail.com> wrote:
>>
>> > Hi Zhu Zhu,
>> >
>> > Thanks for your feedback here!
>> >
>> > You are right, user needs to set 2 options:
>> > - cluster.evenly-spread-out-slots=true
>> > - slot.sharing-strategy=TASK_BALANCED_PREFERRED
>> >
>> > Update it to one option is useful at user side,

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-07 Thread Rui Fan

Hi Shammon,

IIUC, you want more flexibility in controlling the two-phase strategy,
right?

> I want this because we would like to add a new slot to TM strategy such
as SLOTS_NUM in the future for OLAP to improve the performance for olap
jobs, which will use TASKS strategy for task to slot. cc Guoyangze

Actually, one option can achieve your requirement, it can control two-phase.
We can add a new enum for this option, and it will use the new strategy for
slot to TM, and use task_balanced strategy for task to slot.

Of course, I think 2 options is more flexible. If the strategy is too many,
2 options are easy for users.

Also, I have a question: What is SLOTS_NUM strategy? Isn't it slot balanced
at tm level?
I want to check whether it's similar to `cluster.evenly-spread-out-slots`.
If they are similar or same, the strategy isn't too many, and one option
may be enough.

Best,
Rui

On Sat, Oct 7, 2023 at 11:29 AM Shammon FY  wrote:

> Thanks Rui, I check the codes and you're right.
>
> As you described above, the entire process is actually two independent
> steps from slot to TM and task to slot. Currenlty we use option
> `cluster.evenly-spread-out-slots` for both of them. Can we provide
> different options for the two steps, such as ANY/SLOTS for RM and ANY/TASKS
> for slot pool?
>
> I want this because we would like to add a new slot to TM strategy such as
> SLOTS_NUM in the future for OLAP to improve the performance for olap jobs,
> which will use TASKS strategy for task to slot. cc Guoyangze
>
> Best,
> Shammon FY
>
> On Fri, Oct 6, 2023 at 6:19 PM xiangyu feng  wrote:
>
>> Thanks Yuepeng and Rui for driving this Discussion.
>>
>> Internally when we try to use Flink 1.17.1 in production, we are also
>> suffering from the unbalanced task distribution problem for jobs with high
>> qps and complex dag. So +1 for the overall proposal.
>>
>> Some questions about the details:
>>
>> 1, About the waiting mechanism: Will the waiting mechanism happen only in
>> the second level 'assigning slots to TM'?  IIUC, the first level
>> 'assigning
>> Tasks to Slots' needs only the asynchronous slot result from slotpool.
>>
>> 2, About the slot LoadingWeight: it is reasonable to use the number of
>> tasks by default in the beginning, but it would be better if this could be
>> easily extended in future to distinguish between CPU-intensive and
>> IO-intensive workloads. In some cases, TMs may have IO bottlenecks but
>> others have CPU bottlenecks.
>>
>> Regards,
>> Xiangyu
>>
>>
>> Yuepeng Pan  于2023年10月5日周四 18:34写道：
>>
>> > Hi, Zhu Zhu,
>> >
>> > Thanks for your feedback!
>> >
>> > > I think we can introduce a new config option
>> > > `taskmanager.load-balance.mode`,
>> > > which accepts "None"/"Slots"/"Tasks".
>> `cluster.evenly-spread-out-slots`
>> > > can be superseded by the "Slots" mode and get deprecated. In the
>> future
>> > > it can support more mode, e.g. "CpuCores", to work better for jobs
>> with
>> > > fine-grained resources. The proposed config option
>> > > `slot.request.max-interval`
>> > > then can be renamed to
>> > `taskmanager.load-balance.request-stablizing-timeout`
>> > > to show its relation with the feature. The proposed
>> > `slot.sharing-strategy`
>> > > is not needed, because the configured "Tasks" mode will do the work.
>> >
>> > The new proposed configuration option sounds good to me.
>> >
>> > I have a small question, If we set our configuration value to 'Tasks,'
>> it
>> > will initiate two processes: balancing the allocation of task
>> quantities at
>> > the slot level and balancing the number of tasks across TaskManagers
>> (TMs).
>> > Alternatively, if we configure it as 'Slots,' the system will employ the
>> > LocalPreferred allocation policy (which is the default) when assigning
>> > tasks to slots, and it will ensure that the number of slots used across
>> TMs
>> > is balanced.
>> > Does  this configuration essentially combine a balanced selection
>> strategy
>> > across two dimensions into fixed configuration items, right?
>> >
>> > I would appreciate it if you could correct me if I've made any errors.
>> >
>> > Best,
>> > Yuepeng.
>> >
>>
>

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-07 Thread Rui Fan

Hi Yangze,

> 2. From my understanding, if user enable the
> cluster.evenly-spread-out-slots,
> LeastUtilizationResourceMatchingStrategy will be used to determine the
> slot distribution and the slot allocation in the three TM will be
> (taskmanager.numberOfTaskSlots=3):
> TM1: 3 slot
> TM2: 2 slot
> TM3: 2 slot

When all tms are ready in advance, the three TM will be:
TM1: 3 slot
TM2: 2 slot
TM3: 2 slot

For application mode, the resource manager doesn't apply for
TM in advance, and slots aren't enough before the third TM is ready.
So all slots of the second TM will be used up. The three TM will be:
TM1: 3 slot
TM2: 3 slot
TM3: 1 slot

That's why the FLIP add some notes:

   - All *free* slots are in the last TM, because ResourceManager doesn’t
   have the waiting mechanism, and it just requests 7 slots for this JobMaster.
   - Why is it acceptable?


   -
  - If we just add the waiting mechanism to JobMaster but not in
  ResourceManager, all *free* slots will be in the last TM. All slots
  of other TMs are offered to JM.
  - That is, only one TM may have fewer tasks than the other TMs. The
  difference between the number of tasks of other TMs is at most 1.So When
  *p* >> *slotsPerTM*, the problem can be ignored.
  - We can also suggest users, in cases that p is small, it's better to
  configure *slotsPerTM* to 1, or let *p % slotsPerTM* == 0.

Please correct me if my understanding is wrong, thanks~

Best,
Rui

On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo  wrote:

> Hi, Rui,
>
> 1. With the current mechanism, when physical slots are offered from
> TM, the JobMaster will start deploying tasks and synchronizing their
> states. With the addition of the waiting mechanism, IIUC, the
> JobMaster will deploy and synchronize the states of all tasks only
> after all resources are available. The task deployment and state
> synchronization both occupy the JobMaster's RPC main thread. In
> complex jobs with a lot of tasks, this waiting mechanism may increase
> the pressure on the JobMaster and increase the end-to-end job
> deployment time.
>
> 2. From my understanding, if user enable the
> cluster.evenly-spread-out-slots,
> LeastUtilizationResourceMatchingStrategy will be used to determine the
> slot distribution and the slot allocation in the three TM will be
> (taskmanager.numberOfTaskSlots=3):
> TM1: 3 slot
> TM2: 2 slot
> TM3: 2 slot
>
> Best,
> Yangze Guo
>
> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Shammon,
> >
> > Thanks for your feedback as well!
> >
> > > IIUC, the overall balance is divided into two parts: slot to TM and
> task
> > to slot.
> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> > > 2. Task to slot is guaranteed by the slot pool in JM
> > >
> > > These two are completely independent, what are the benefits of unifying
> > > these two into one option? Also, do we want to share the same
> > > option between SlotPool in JM and SlotManager in RM? This sounds a bit
> > > strange.
> >
> > Your understanding is totally right, the balance needs 2 parts: slot to
> TM
> > and task to slot.
> >
> > As I understand, the following are benefits of unifying them into one
> > option:
> >
> > - Flink users don't care about these principles inside of flink, they
> don't
> > know these 2 parts.
> > - If flink provides 2 options, flink users need to set 2 options for
> their
> > job.
> > - If one option is missed, the final result may not be good. (Users may
> > have questions when using)
> > - If flink just provides 1 option, enabling one option is enough. (Reduce
> > the probability of misconfiguration)
> >
> > Also, Flink’s options are user-oriented. Each option represents a switch
> or
> > parameter of a feature.
> > A feature may be composed of multiple components inside Flink.
> > It might be better to keep only one switch per feature.
> >
> > Actually, the cluster.evenly-spread-out-slots option is used between
> > SlotPool in JM and SlotManager in RM. 2 components to ensure
> > this feature works well.
> >
> > Please correct me if my understanding is wrong,
> > and looking forward to your feedback, thanks!
> >
> > Best,
> > Rui
> >
> > On Sun, Oct 1, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Yangze,
> > >
> > > Thanks for your feedback!
> > >
> > > > 1. Is it possible for the SlotPool to get the slot allocation results
> > > > from the SlotManager in advance instead of w

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-07 Thread Rui Fan

Hi Yangze,

Thanks for your quick response!

Sorry, I re-read the 2.2.2 part[1] about the Waiting Mechanism, I found
it isn't clear. The root cause of introducing the waiting mechanism is
that the slot requests are sent from JobMaster to SlotPool is
one by one instead of one whole batch. I have rewritten the 2.2.2 part,
please read it again in your free time.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling#FLIP370:SupportBalancedTasksScheduling-2.2.2Waitingmechanism

Best,
Rui

On Sat, Oct 7, 2023 at 4:34 PM Yangze Guo  wrote:

> Thanks for the clarification, Rui.
>
> I believe the root cause of this issue is that in the current
> DefaultResourceAllocationStrategy, slot allocation begins before the
> decision to PendingTaskManagers requesting is made. That can be fixed
> within the strategy without introducing another waiting mechanism. I
> think it would be better to address this issue within the scope of
> this FLIP. However, I don't have a strong opinion on it, it depends on
> your bandwidth.
>
>
> Best,
> Yangze Guo
>
> On Sat, Oct 7, 2023 at 4:16 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Yangze,
> >
> > > 2. From my understanding, if user enable the
> > > cluster.evenly-spread-out-slots,
> > > LeastUtilizationResourceMatchingStrategy will be used to determine the
> > > slot distribution and the slot allocation in the three TM will be
> > > (taskmanager.numberOfTaskSlots=3):
> > > TM1: 3 slot
> > > TM2: 2 slot
> > > TM3: 2 slot
> >
> > When all tms are ready in advance, the three TM will be:
> > TM1: 3 slot
> > TM2: 2 slot
> > TM3: 2 slot
> >
> > For application mode, the resource manager doesn't apply for
> > TM in advance, and slots aren't enough before the third TM is ready.
> > So all slots of the second TM will be used up. The three TM will be:
> > TM1: 3 slot
> > TM2: 3 slot
> > TM3: 1 slot
> >
> > That's why the FLIP add some notes:
> >
> > All free slots are in the last TM, because ResourceManager doesn’t have
> the waiting mechanism, and it just requests 7 slots for this JobMaster.
> > Why is it acceptable?
> >
> > If we just add the waiting mechanism to JobMaster but not in
> ResourceManager, all free slots will be in the last TM. All slots of other
> TMs are offered to JM.
> > That is, only one TM may have fewer tasks than the other TMs. The
> difference between the number of tasks of other TMs is at most 1.So When p
> >> slotsPerTM, the problem can be ignored.
> > We can also suggest users, in cases that p is small, it's better to
> configure slotsPerTM to 1, or let p % slotsPerTM == 0.
> >
> > Please correct me if my understanding is wrong, thanks~
> >
> > Best,
> > Rui
> >
> > On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo  wrote:
> >>
> >> Hi, Rui,
> >>
> >> 1. With the current mechanism, when physical slots are offered from
> >> TM, the JobMaster will start deploying tasks and synchronizing their
> >> states. With the addition of the waiting mechanism, IIUC, the
> >> JobMaster will deploy and synchronize the states of all tasks only
> >> after all resources are available. The task deployment and state
> >> synchronization both occupy the JobMaster's RPC main thread. In
> >> complex jobs with a lot of tasks, this waiting mechanism may increase
> >> the pressure on the JobMaster and increase the end-to-end job
> >> deployment time.
> >>
> >> 2. From my understanding, if user enable the
> >> cluster.evenly-spread-out-slots,
> >> LeastUtilizationResourceMatchingStrategy will be used to determine the
> >> slot distribution and the slot allocation in the three TM will be
> >> (taskmanager.numberOfTaskSlots=3):
> >> TM1: 3 slot
> >> TM2: 2 slot
> >> TM3: 2 slot
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote:
> >> >
> >> > Hi Shammon,
> >> >
> >> > Thanks for your feedback as well!
> >> >
> >> > > IIUC, the overall balance is divided into two parts: slot to TM and
> task
> >> > to slot.
> >> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> >> > > 2. Task to slot is guaranteed by the slot pool in JM
> >> > >
> >> > > These two are completely independent, what are the benefits of
> unifying
> >> >

Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway

2023-10-07 Thread Rui Fan

Thanks to Yangze driving this proposal.

`env.java.opts.xxx` is already supported for client, historyserver,
jobmanager and taskmanager. And it's very useful for troubleshooting.
So +1 for `env.java.opts.sql-gateway`.

I have a minor question: doesn't the `env.java.opts.all` support
sql-gateway?
If yes, it's fine. If no, it's better to consider it to be the subtask of
this FLIP.

Best,
Rui


On Sat, Oct 7, 2023 at 4:35 PM xiangyu feng  wrote:

> Thanks for initiating this discussion. Within the development towards
> Streaming Warehousing, SQL Gateway will become more and more important.
> Big +1 to specify Java Options separately for SQL Gateway.
>
> Regards,
> Xiangyu
>
> Yangze Guo  于2023年10月7日周六 15:24写道：
>
> > Hi, there,
> >
> > We'd like to start a discussion thread on "FLIP-374: Adding a separate
> > configuration for specifying Java Options of the SQL Gateway"[1],
> > where we propose adding a separate configuration option to specify the
> > Java options for the SQL Gateway. This would allow users to fine-tune
> > the memory settings, garbage collection behavior, and other relevant
> > Java parameters specific to the SQL Gateway, ensuring optimal
> > performance and stability in production environments.
> >
> > Looking forward to your feedback.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway
> >
> > Best,
> > Yangze Guo
> >
>

Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

2023-10-09 Thread Rui Fan

Thanks Yun and Yu for driving this proposal!

It's very useful for troubleshooting why the CPU usage is high.
+1

Best,
Rui

On Mon, Oct 9, 2023 at 7:21 PM Zhanghao Chen 
wrote:

> Hi Yun and Yu,
>
> Thanks for driving this. This would definitely help users identify
> performance bottlenecks, especially for the cases where the bottleneck lies
> in the system stack (e.g. GC), and big +1 for the downloadable flamegraph
> to ease sharing. I'm wondering if we could add this for the job manager as
> well. In the OLAP scenario and sometimes in the streaming scenario (when
> there're some heavy operations during execution plan generation or in
> operator coordinators), the JM can have bottleneck as well.
>
> Best,
> Zhanghao Chen
> 
> From: Yu Chen 
> Sent: Monday, October 9, 2023 17:24
> To: dev@flink.apache.org 
> Subject: [DISCUSS] FLIP-375: Built-in cross-platform powerful java
> profiler on taskmanagers
>
> Hi all,
>
> Yun Tang and I are opening this thread to discuss our proposal to integrate
> async-profiler's capabilities for profiling taskmananger (e.g., generating
> flame graphs) in the Flink Web [1].
>
>
> Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by
> sampling task threads. The results generated in such way missing the
> relevant stack of java threads and system calls. The async-profiler[2] is a
> low-overhead sampling profiler for Java, but the steps to use it in the
> production environment are cumbersome and suffer from permissions and
> security risks.
>
> Therefore, we propose adding rest APIs to provide the capability to invoke
> async-profiler on multiple platforms through JNI, which can be easily
> operated on Web UI. This enhancement will improve the efficiency and
> experience of Flink users in identifying performance bottlenecks.
>
>
>
> Please refer to the FLIP document for more details about the proposed
> design
> and implementation. We welcome any feedback and opinions on this proposal.
>
>
>
> [1] FLIP-375: Built-in cross-platform powerful java profiler on
> taskmanagers - Apache Flink - Apache Software Foundation
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> >
>
> [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP profiler
> for Java featuring AsyncGetCallTrace + perf_events
> 
>
>
>
> Best regards,
>
> Yun Tang and Yu Chen
>

Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

2023-10-09 Thread Rui Fan

Hi Jing,

> 1. will it replace the current flame graph, i.e. the current flame graph
will be deprecated and removed?

I think the current flame graph cannot be removed.

As a core contributor to the current flame graph, and I use it almost
every week. I would like to clarify the difference between the current
flame graph and the flame graph proposed by FLIP-375.

@The current flame graph

The current flame graph is the operator level or task level, when one
operator is the bottleneck of current job. We can see the current
flamegraph to check what the operator is doing.

It includes three types: On-CPU, Off-CPU and Mixed-Type. The Mixed-Type
is very useful, it can detect why operator is slow even if the operator
doesn't use CPU. For example, the operator is blocked on querying hbase.

It just support the task thread, it means it cannot detect the cpu usage of
other threads, such as: RocksDB Flush or compaction. This's the
limitation of current flamegraph.

@The flame graph proposed by FLIP-375.

The flamegraph proposed by FLIP-375 works on process level, such as
JobManager or TaskManager, so it can monitor all threads. Such as:
rocksdb background threads.

When the CPU usage of one TM is high, and all tasks are not busy.
The new flamegraph will be useful.

Back to the question: It includes task or operator thread,
why the current flamegraph is still needed?

1. The flamegraph of process level cannot easily distinguish tasks.
Especially if there are multiple slots in a TM, and different subtasks of
the
same task running in multiple slots, their stacks are very similar.

2. The Mixed-Type of current flamegraph may not be replaced by the
process-level flame graph.

Please correct me if anything is wrong, thanks~

Hi Yu,

> Jobmanager allows the user to download the results of the corresponding
files on taskmanager with the blob service.

Are all process-level flamegraphs stored at BlobStore?
Are they maintained by JobManager after sampling?
Is there cleanup strategy? Or max-save-count strategy?

Best,
Rui

On Tue, Oct 10, 2023 at 1:24 AM Jing Ge  wrote:

> Hi Yu, Hi Yun,
>
> Brilliant idea! People are keen to use it. Thanks for your proposal! I was
> wondering:
>
> 1. will it replace the current flame graph, i.e. the current flame graph
> will be deprecated and removed?
> 2. does it make sense to provide the performance difference between enable
> and disable it?
>
> Best regards,
> Jing
>
> On Mon, Oct 9, 2023 at 1:50 PM Yu Chen  wrote:
>
> > Hi zhanghao,
> >
> > Yes, agree with you. We'll take Jobmanager into consideration and update
> > the FLIP later!
> >
> > Best,
> > Yu Chen
> >
> > Zhanghao Chen  于2023年10月9日周一 19:22写道：
> >
> > > Hi Yun and Yu,
> > >
> > > Thanks for driving this. This would definitely help users identify
> > > performance bottlenecks, especially for the cases where the bottleneck
> > lies
> > > in the system stack (e.g. GC), and big +1 for the downloadable
> flamegraph
> > > to ease sharing. I'm wondering if we could add this for the job manager
> > as
> > > well. In the OLAP scenario and sometimes in the streaming scenario
> (when
> > > there're some heavy operations during execution plan generation or in
> > > operator coordinators), the JM can have bottleneck as well.
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > From: Yu Chen 
> > > Sent: Monday, October 9, 2023 17:24
> > > To: dev@flink.apache.org 
> > > Subject: [DISCUSS] FLIP-375: Built-in cross-platform powerful java
> > > profiler on taskmanagers
> > >
> > > Hi all,
> > >
> > > Yun Tang and I are opening this thread to discuss our proposal to
> > integrate
> > > async-profiler's capabilities for profiling taskmananger (e.g.,
> > generating
> > > flame graphs) in the Flink Web [1].
> > >
> > >
> > > Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by
> > > sampling task threads. The results generated in such way missing the
> > > relevant stack of java threads and system calls. The async-profiler[2]
> > is a
> > > low-overhead sampling profiler for Java, but the steps to use it in the
> > > production environment are cumbersome and suffer from permissions and
> > > security risks.
> > >
> > > Therefore, we propose adding rest APIs to provide the capability to
> > invoke
> > > async-profiler on multiple platforms through JNI, which can be easily
> > > operated on Web UI. This enhancement will improve the efficiency and
> > > experience of Flink users in identifying performance bottlenecks.
> > >
> > >
> > >
> > > Please refer to the FLIP document for more details about the proposed
> > > design
> > > and implementation. We welcome any feedback and opinions on this
> > proposal.
> > >
> > >
> > >
> > > [1] FLIP-375: Built-in cross-platform powerful java profiler on
> > > taskmanagers - Apache Flink - Apache Software Foundation
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> >

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-09 Thread Rui Fan

Hi Zhu,

Thanks for your feedback!

>> 2. When it's set to Tasks, how to assign slots to TM?
> It's option2 at the moment. However, I think it's just implementation
> details and can be changed/refined later.
>
> As you mentioned in another comment, 'taskmanager.load-balance.mode' is
> a user oriented configuration. The goal is to achieve load balance, while
> the load can be defined as allocated slots or assigned tasks.
> The 'Tasks' mode, just the same as what is proposed in the FLIP, currently
> use the mechanism of 'cluster.evenly-spread-out-slots' to help to achieve
> balanced number of tasks. It's not perfect, but has acceptable
effectiveness
> and lower implementation complexity.
>
> The 'Slots' mode is needed for compatible reasons. Users that are
satisfied
> with the current ability of 'cluster.evenly-spread-out-slots' can continue
> using it after the config 'cluster.evenly-spread-out-slots' is deprecated.

IIUC, the 'Slots' mode is needed for compatibility with
'cluster.evenly-spread-out-slots'.
The reason I ask this question is: if the behavior and logic of 'Slots' and
'Tasks' are exactly the same, it feels a bit strange to define two
enumerations.
And it may cause confusion to users.

If they are totally the same, how about combining them to SlotsAndTasks?
It can be compatible with 'cluster.evenly-spread-out-slots', and avoid
the redundant enum. Of course, if the name(SlotsAndTasks) is ugly,
we can discuss it. The core idea is combining them.

WDYT?

Best,
Rui

On Mon, Oct 9, 2023 at 3:24 PM Zhu Zhu  wrote:

> Thanks for the response, Rui and Yuepeng.
>
> >> Rui
> > 1. The default value is None, right?
> Exactly.
>
> > 2. When it's set to Tasks, how to assign slots to TM?
> It's option2 at the moment. However, I think it's just implementation
> details and can be changed/refined later.
>
> As you mentioned in another comment, 'taskmanager.load-balance.mode' is
> a user oriented configuration. The goal is to achieve load balance, while
> the load can be defined as allocated slots or assigned tasks.
> The 'Tasks' mode, just the same as what is proposed in the FLIP, currently
> use the mechanism of 'cluster.evenly-spread-out-slots' to help to achieve
> balanced number of tasks. It's not perfect, but has acceptable
> effectiveness
> and lower implementation complexity.
>
> The 'Slots' mode is needed for compatible reasons. Users that are satisfied
> with the current ability of 'cluster.evenly-spread-out-slots' can continue
> using it after the config 'cluster.evenly-spread-out-slots' is deprecated.
>
>
> >> Yuepeng
> I think what users want is load balance. The combination is implementation
> details and should be transparent to users.
>
> Meanwhile, I think locality does not entirely conflict with load balance.
> In fact,
> they should be both considered when assigning tasks. Usually, state
> locality
> should have the highest priority, and input locality can also be taken care
> of when trying to balance tasks to slots and TMs. We can see that the most
> important input locality, i.e. forward, is always covered in this FLIP when
> computing slot sharing groups. It can be further optimized if we find it
> problematic.
>
> Thanks,
> Zhu
>
> Yangze Guo  于2023年10月8日周日 13:53写道：
>
>> Thanks for the updates, Rui.
>>
>> It does seem challenging to ensure evenness in slot deployment unless
>> we introduce batch slot requests in SlotPool. However, one possibility
>> is to add a delay of around 50ms during the SlotPool's resource
>> requirement declaration to the ResourceManager, similar to the
>> checkResourceRequirementsWithDelay in the SlotManager. In most cases,
>> this delay would allow the SlotManager to see all resource
>> requirements, then it can allocate the slot more evenly. As a side
>> effect, it could also significantly reduce the number of RPC messages
>> to the ResourceManager, which could become a single-point bottleneck
>> in OLAP scenarios. WDYT?
>>
>> Best,
>> Yangze Guo
>>
>> On Sat, Oct 7, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote:
>> >
>> > Hi Yangze,
>> >
>> > Thanks for your quick response!
>> >
>> > Sorry, I re-read the 2.2.2 part[1] about the Waiting Mechanism, I found
>> > it isn't clear. The root cause of introducing the waiting mechanism is
>> > that the slot requests are sent from JobMaster to SlotPool is
>> > one by one instead of one whole batch. I have rewritten the

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-10 Thread Rui Fan

Hi Zhu,

Thanks for your clarification!

I misunderstood before, it's clear now.

Best,
Rui

On Tue, Oct 10, 2023 at 6:17 PM Zhu Zhu  wrote:

> Hi Rui,
>
> Not sure if I understand your question correctly. The two modes are not
> the same:
> {taskmanager.load-balance.mode: Slots} = {cluster.evenly-spread-out-slots:
> true, slot.sharing-strategy: LOCAL_INPUT_PREFERRED}
> {taskmanager.load-balance.mode: Tasks} = {cluster.evenly-spread-out-slots:
> true, slot.sharing-strategy: TASK_BALANCED_PREFERRED}
>
> Thanks,
> Zhu
>
> Rui Fan <1996fan...@gmail.com> 于2023年10月10日周二 10:27写道：
>
>> Hi Zhu,
>>
>> Thanks for your feedback!
>>
>> >> 2. When it's set to Tasks, how to assign slots to TM?
>> > It's option2 at the moment. However, I think it's just implementation
>> > details and can be changed/refined later.
>> >
>> > As you mentioned in another comment, 'taskmanager.load-balance.mode' is
>> > a user oriented configuration. The goal is to achieve load balance,
>> while
>> > the load can be defined as allocated slots or assigned tasks.
>> > The 'Tasks' mode, just the same as what is proposed in the FLIP,
>> currently
>> > use the mechanism of 'cluster.evenly-spread-out-slots' to help to
>> achieve
>> > balanced number of tasks. It's not perfect, but has acceptable
>> effectiveness
>> > and lower implementation complexity.
>> >
>> > The 'Slots' mode is needed for compatible reasons. Users that are
>> satisfied
>> > with the current ability of 'cluster.evenly-spread-out-slots' can
>> continue
>> > using it after the config 'cluster.evenly-spread-out-slots' is
>> deprecated.
>>
>> IIUC, the 'Slots' mode is needed for compatibility with
>> 'cluster.evenly-spread-out-slots'.
>> The reason I ask this question is: if the behavior and logic of 'Slots'
>> and
>> 'Tasks' are exactly the same, it feels a bit strange to define two
>> enumerations.
>> And it may cause confusion to users.
>>
>> If they are totally the same, how about combining them to SlotsAndTasks?
>> It can be compatible with 'cluster.evenly-spread-out-slots', and avoid
>> the redundant enum. Of course, if the name(SlotsAndTasks) is ugly,
>> we can discuss it. The core idea is combining them.
>>
>> WDYT?
>>
>> Best,
>> Rui
>>
>> On Mon, Oct 9, 2023 at 3:24 PM Zhu Zhu  wrote:
>>
>>> Thanks for the response, Rui and Yuepeng.
>>>
>>> >> Rui
>>> > 1. The default value is None, right?
>>> Exactly.
>>>
>>> > 2. When it's set to Tasks, how to assign slots to TM?
>>> It's option2 at the moment. However, I think it's just implementation
>>> details and can be changed/refined later.
>>>
>>> As you mentioned in another comment, 'taskmanager.load-balance.mode' is
>>> a user oriented configuration. The goal is to achieve load balance,
>>> while
>>> the load can be defined as allocated slots or assigned tasks.
>>> The 'Tasks' mode, just the same as what is proposed in the FLIP,
>>> currently
>>> use the mechanism of 'cluster.evenly-spread-out-slots' to help to achieve
>>> balanced number of tasks. It's not perfect, but has acceptable
>>> effectiveness
>>> and lower implementation complexity.
>>>
>>> The 'Slots' mode is needed for compatible reasons. Users that are
>>> satisfied
>>> with the current ability of 'cluster.evenly-spread-out-slots' can
>>> continue
>>> using it after the config 'cluster.evenly-spread-out-slots' is
>>> deprecated.
>>>
>>>
>>> >> Yuepeng
>>> I think what users want is load balance. The combination is
>>> implementation
>>> details and should be transparent to users.
>>>
>>> Meanwhile, I think locality does not entirely conflict with load
>>> balance. In fact,
>>> they should be both considered when assigning tasks. Usually, state
>>> locality
>>> should have the highest priority, and input locality can also be taken
>>> care
>>> of when trying to balance tasks to slots and TMs. We can see that the
>>> most
>>> important input locality, i.e. forward, is always covered in this FLIP
>>> when
>>> computing slot sharing groups. It can be f

Re: [VOTE] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway

2023-10-10 Thread Rui Fan

+1(binding)

Best,
Rui

On Wed, Oct 11, 2023 at 10:07 AM Yangze Guo  wrote:

> Hi everyone,
>
> I'd like to start the vote of FLIP-374 [1]. This FLIP is discussed in
> the thread [2].
>
> The vote will be open for at least 72 hours. Unless there is an
> objection, I'll try to close it by October 16, 2023 if we have
> received sufficient votes.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway
> [2] https://lists.apache.org/thread/g4vl8mgnwgl7vjyvjy6zrc8w54b2lthv
>
> Best,
> Yangze Guo
>

Re: [VOTE] FLIP-366: Support standard YAML for FLINK configuration

2023-10-12 Thread Rui Fan

+1(binding)

Best,
Rui Fan

On Fri, Oct 13, 2023 at 10:12 AM Junrui Lee  wrote:

> Hi all,
>
> Thank you to everyone for the feedback on FLIP-366[1]: Support standard
> YAML for FLINK configuration in the discussion thread [2].
> I would like to start a vote for it. The vote will be open for at least 72
> hours (excluding weekends, unless there is an objection or an insufficient
> number of votes).
>
> Thanks,
> Junrui
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-366%3A+Support+standard+YAML+for+FLINK+configuration
> [2]https://lists.apache.org/thread/qfhcm7h8r5xkv38rtxwkghkrcxg0q7k5
>

Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

2023-10-13 Thread Rui Fan

One minor comment:

In general, the generic java profiler includes memory analysis,
cpu, thread, deadlock, etc. The FLIP title is java profiler, but
the FLIP just supports flamegraph at process level.
So the `powerful java profiler` title may not be suitable.
Would you mind updating the FLIP title?

Best,
Rui

On Fri, Oct 13, 2023 at 4:34 PM Yu Chen  wrote:

> Hi all.
> If there are no further questions, we will start a vote on FLIP-375 next
> week.
>
> Best regards,
> Yu Chen
>
>
> Yu Chen  于2023年10月9日周一 17:24写道：
>
> > Hi all,
> >
> > Yun Tang and I are opening this thread to discuss our proposal to
> > integrate async-profiler's capabilities for profiling taskmananger (e.g.,
> > generating flame graphs) in the Flink Web [1].
> >
> >
> > Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by
> > sampling task threads. The results generated in such way missing the
> > relevant stack of java threads and system calls. The async-profiler[2]
> is a
> > low-overhead sampling profiler for Java, but the steps to use it in the
> > production environment are cumbersome and suffer from permissions and
> > security risks.
> >
> > Therefore, we propose adding rest APIs to provide the capability to
> invoke
> > async-profiler on multiple platforms through JNI, which can be easily
> > operated on Web UI. This enhancement will improve the efficiency and
> > experience of Flink users in identifying performance bottlenecks.
> >
> >
> >
> > Please refer to the FLIP document for more details about the proposed
> design
> > and implementation. We welcome any feedback and opinions on this
> proposal.
> >
> >
> >
> > [1] FLIP-375: Built-in cross-platform powerful java profiler on
> > taskmanagers - Apache Flink - Apache Software Foundation
> > <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> >
> >
> > [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP
> > profiler for Java featuring AsyncGetCallTrace + perf_events
> > 
> >
> >
> >
> > Best regards,
> >
> > Yun Tang and Yu Chen
> >
>

Re: [ANNOUNCE] New Apache Flink Committer - Ron Liu

2023-10-15 Thread Rui Fan

Congratulations Ron !

Best,
Rui

On Mon, Oct 16, 2023 at 10:05 AM Lijie Wang 
wrote:

> Congratulations Ron !
>
> Best,
> Lijie
>
> Samrat Deb  于2023年10月16日周一 10:03写道：
>
> > Congratulations Ron Liu :)
> >
> > On Mon, 16 Oct 2023 at 7:29 AM, tison  wrote:
> >
> > > Congrats! Glad to see more and more committers on board :D
> > >
> > > Enjoy your journey ;-)
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Jark Wu  于2023年10月16日周一 09:57写道：
> > >
> > > > Hi, everyone
> > > >
> > > > On behalf of the PMC, I'm very happy to announce Ron Liu as a new
> Flink
> > > > Committer.
> > > >
> > > > Ron has been continuously contributing to the Flink project for many
> > > years,
> > > > authored and reviewed a lot of codes. He mainly works on Flink SQL
> > parts
> > > > and drove several important FLIPs, e.g., USING JAR (FLIP-214),
> Operator
> > > > Fusion CodeGen (FLIP-315), Runtime Filter (FLIP-324). He has a great
> > > > knowledge of the Batch SQL and improved a lot of batch performance in
> > the
> > > > past several releases. He is also quite active in mailing lists,
> > > > participating in discussions and answering user questions.
> > > >
> > > > Please join me in congratulating Ron Liu for becoming a Flink
> > Committer!
> > > >
> > > > Best,
> > > > Jark Wu (on behalf of the Flink PMC)
> > > >
> > >
> >
>

Re: [ANNOUNCE] New Apache Flink Committer - Jane Chan

2023-10-15 Thread Rui Fan

Congratulations Jane!

Best,
Rui

On Mon, Oct 16, 2023 at 10:15 AM yu zelin  wrote:

> Congratulations!
>
> Best,
> Yu Zelin
>
> > 2023年10月16日 09:58，Jark Wu  写道：
> >
> > Hi, everyone
> >
> > On behalf of the PMC, I'm very happy to announce Jane Chan as a new Flink
> > Committer.
> >
> > Jane started code contribution in Jan 2021 and has been active in the
> Flink
> > community since. She authored more than 60 PRs and reviewed more than 40
> > PRs. Her contribution mainly revolves around Flink SQL, including Plan
> > Advice (FLIP-280), operator-level state TTL (FLIP-292), and ALTER TABLE
> > statements (FLINK-21634). Jane participated deeply in development
> > discussions and also helped answer user question emails. Jane was also a
> > core contributor of Flink Table Store (now Paimon) when the project was
> in
> > the early days.
> >
> > Please join me in congratulating Jane Chan for becoming a Flink
> Committer!
> >
> > Best,
> > Jark Wu (on behalf of the Flink PMC)
>
>

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-16 Thread Rui Fan

gt; >
> > For Zhuzhu and Rui:
> >
> > IIUC, the overall balance is divided into two parts: slot to TM and task
> to
> > slot.
> > 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> > 2. Task to slot is guaranteed by the slot pool in JM
> >
> > These two are completely independent, what are the benefits of unifying
> > these two into one option? Also, do we want to share the same
> > option between SlotPool in JM and SlotManager in RM? This sounds a bit
> > strange.
> >
> > Best,
> > Shammon FY
> >
> >
> >
> > On Thu, Sep 28, 2023 at 12:08 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Zhu Zhu,
> > >
> > > Thanks for your feedback here!
> > >
> > > You are right, user needs to set 2 options:
> > > - cluster.evenly-spread-out-slots=true
> > > - slot.sharing-strategy=TASK_BALANCED_PREFERRED
> > >
> > > Update it to one option is useful at user side, so
> > > `taskmanager.load-balance.mode` sounds good to me.
> > > I want to check some points and behaviors about this option:
> > >
> > > 1. The default value is None, right?
> > > 2. When it's set to Tasks, how to assign slots to TM?
> > > - Option1: It's just check task number
> > > - Option2: It''s check the slot number first, then check the
> > > task number when the slot number is the same.
> > >
> > > Giving an example to explain what's the difference between them:
> > >
> > > - A session cluster has 2 flink jobs, they are jobA and jobB
> > > - Each TM has 4 slots.
> > > - The task number of one slot of jobA is 3
> > > - The task number of one slot of jobB is 1
> > > - We have 2 TaskManagers:
> > >   - tm1 runs 3 slots of jobB, so tm1 runs 3 tasks
> > >   - tm2 runs 1 slot of jobA, and 1 slot of jobB, so tm2 runs 4 tasks.
> > >
> > > Now, we need to run a new slot, which tm should offer it?
> > > - Option1: If we just check the task number, the tm1 is better.
> > > - Option2: If we check the slot number first, and then check task, the
> tm2
> > > is better
> > >
> > > The original FLIP selected option2, that's why we didn't add the
> > > third option. The option2 didn't break the semantics when
> > > `cluster.evenly-spread-out-slots` is true, and it just improve the
> > > behavior without the semantics is changed.
> > >
> > > In the other hands, if we choose option2, when user set
> > > `taskmanager.load-balance.mode` is Tasks. It also can achieve
> > > the goal when it's Slots.
> > >
> > > So I think the `Slots` enum isn't needed if we choose option2.
> > > Of course, If we choose the option1, the enum is needed.
> > >
> > > Looking forward to your feedback, thanks~
> > >
> > > Best,
> > > Rui
> > >
> > > On Wed, Sep 27, 2023 at 9:11 PM Zhu Zhu  wrote:
> > >
> > > > Thanks Yuepeng and Rui for creating this FLIP.
> > > >
> > > > +1 in general
> > > > The idea is straight forward: best-effort gather all the slot
> requests
> > > > and offered slots to form an overview before assigning slots, trying
> to
> > > > balance the loads of task managers when assigning slots.
> > > >
> > > > I have one comment regarding the configuration for ease of use:
> > > >
> > > > IIUC, this FLIP uses an existing config
> 'cluster.evenly-spread-out-slots'
> > > > as the main switch of the new feature. That is, from user
> perspective,
> > > > with this improvement, the 'cluster.evenly-spread-out-slots' feature
> not
> > > > only balances the number of slots on task managers, but also
> balances the
> > > > number of tasks. This is a behavior change anyway. Besides that, it
> also
> > > > requires users to set 'slot.sharing-strategy' to
> > > 'TASK_BALANCED_PREFERRED'
> > > > to balance the tasks in each slot.
> > > >
> > > > I think we can introduce a new config option
> > > > `taskmanager.load-balance.mode`,
> > > > which accepts "None"/"Slots"/"Tasks".
> `cluster.evenly-spread-out-slots`
> > > > can be superseded by the "Slots" mode and get deprecated. In the
> future
> > > > it can support more mode, e.g. "CpuCores", to work better for jobs
&g

[DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-16 Thread Rui Fan

Hi all,

I would like to start a discussion on FLIP-364: Improve the
restart-strategy[1]

As we know, the restart-strategy is critical for flink jobs, it mainly
has two functions:
1. When an exception occurs in the flink job, quickly restart the job
so that the job can return to the running state.
2. When a job cannot be recovered after frequent restarts within
a certain period of time, Flink will not retry but will fail the job.

The current restart-strategy support for function 2 has some issues:
1. The exponential-delay doesn't have the max attempts mechanism,
it means that flink will restart indefinitely even if it fails frequently.
2. For multi-region streaming jobs and all batch jobs, the failure of
each region will increase the total number of job failures by +1,
even if these failures occur at the same time. If the number of
failures increases too quickly, it will be difficult to set a reasonable
number of retries.
If the maximum number of failures is set too low, the job can easily
reach the retry limit, causing the job to fail. If set too high, some jobs
will never fail.

In addition, when the above two problems are solved, we can also
discuss whether exponential-delay can replace fixed-delay as the
default restart-strategy. In theory, exponential-delay is smarter and
friendlier than fixed-delay.

I also thank Zhu Zhu for his suggestions on the option name in
FLINK-32895[2] in advance.

Looking forward to and welcome everyone's feedback and suggestions, thank
you.

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://issues.apache.org/jira/browse/FLINK-32895

Best,
Rui

Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

2023-10-16 Thread Rui Fan

ther than the way the profiler is best used, it feels like this FLIP
> > should support async-profiler's regular modes of operation & the other
> > most-common configuration options. From my own experience, `cpu`
> (requiring
> > perf_events) is a bit more accurate than `itimer`, and if I recall, and
> > samples once per thread. `wall` is very useful to debug blocks on I/O or
> > locks. Getting per-thread information is nice to drill down into specific
> > parts of the Flink application, e.g. the flame graph lets me ignore the
> > many other tasks running on TM & drill down into just the Source threads,
> > when debugging a Source issue.
> >
> > Kind regards,
> > David
> >
> > On Fri, Oct 13, 2023 at 1:45 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > One minor comment:
> > >
> > > In general, the generic java profiler includes memory analysis,
> > > cpu, thread, deadlock, etc. The FLIP title is java profiler, but
> > > the FLIP just supports flamegraph at process level.
> > > So the `powerful java profiler` title may not be suitable.
> > > Would you mind updating the FLIP title?
> > >
> > > Best,
> > > Rui
> > >
> > > On Fri, Oct 13, 2023 at 4:34 PM Yu Chen  wrote:
> > >
> > > > Hi all.
> > > > If there are no further questions, we will start a vote on FLIP-375
> > next
> > > > week.
> > > >
> > > > Best regards,
> > > > Yu Chen
> > > >
> > > >
> > > > Yu Chen  于2023年10月9日周一 17:24写道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > Yun Tang and I are opening this thread to discuss our proposal to
> > > > > integrate async-profiler's capabilities for profiling taskmananger
> > > (e.g.,
> > > > > generating flame graphs) in the Flink Web [1].
> > > > >
> > > > >
> > > > > Currently, Flink provides ThreadDump and Operator-Level Flame
> Graphs
> > by
> > > > > sampling task threads. The results generated in such way missing
> the
> > > > > relevant stack of java threads and system calls. The
> > async-profiler[2]
> > > > is a
> > > > > low-overhead sampling profiler for Java, but the steps to use it in
> > the
> > > > > production environment are cumbersome and suffer from permissions
> and
> > > > > security risks.
> > > > >
> > > > > Therefore, we propose adding rest APIs to provide the capability to
> > > > invoke
> > > > > async-profiler on multiple platforms through JNI, which can be
> easily
> > > > > operated on Web UI. This enhancement will improve the efficiency
> and
> > > > > experience of Flink users in identifying performance bottlenecks.
> > > > >
> > > > >
> > > > >
> > > > > Please refer to the FLIP document for more details about the
> proposed
> > > > design
> > > > > and implementation. We welcome any feedback and opinions on this
> > > > proposal.
> > > > >
> > > > >
> > > > >
> > > > > [1] FLIP-375: Built-in cross-platform powerful java profiler on
> > > > > taskmanagers - Apache Flink - Apache Software Foundation
> > > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> > > > >
> > > > >
> > > > > [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP
> > > > > profiler for Java featuring AsyncGetCallTrace + perf_events
> > > > > <https://github.com/async-profiler/async-profiler>
> > > > >
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Yun Tang and Yu Chen
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] FLIP-375: Built-in cross-platform powerful java profiler

2023-10-17 Thread Rui Fan

+1(binding)

Best,
Rui

On Tue, Oct 17, 2023 at 3:52 PM Yu Chen  wrote:

> Hi all,
>
> Thank you to everyone for the feedback and detailed comments on
> FLIP-375[1].
> Based on the discussion thread [2], I think we are ready to take a vote to
> contribute this to Flink.
> I'd like to start a vote for it.
> The vote will be open for at least 72 hours (excluding weekends, unless
> there is an objection or an insufficient number of votes).
>
> Thanks,
> Yu Chen
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler
> [2] https://lists.apache.org/thread/tp5vqgspsdko66dr6vm7cgtod9k2pct7
>

Re: [DISCUSS] Creating Kubernetes Operator release 1.6.1

2023-10-18 Thread Rui Fan

Hi Gyula,

Thank you for driving this discussion!

This release seems good to me, I have a question:
I see some bugfix commits have been merged into
the 1.6-release branch. Does it already contain all
recent bugfix commits?

Also, you said in the `Kubernetes Operator 1.6.0 release planning`[1]:

> I am volunteering as the release manager but if someone else wants to do
it, I would also be happy to simply give assistance :)

It's a minor version. For those who have never released before,
a minor version may be a good entry point. Would you mind
if I volunteer as the release manager for 1.6.1?

[1] https://lists.apache.org/thread/5ynjv18nfoj6rvyhlz1g5y8qtxx6v1gd

Best,
Rui

On Thu, Oct 19, 2023 at 1:06 PM Gyula Fóra  wrote:

> Hi All!
>
> I would like to propose to release the 1.6.1 patch version for the
> Kubernetes operator. The release branch currently contains 2-3 critical
> fixes for issues that many users have hit over time.
>
> Making this release now would allow us more time to wrap up and finalize
> the 1.7.0 release changes (some of which are quite big regarding the
> autoscaler)
>
> If there are no objections I will prepare the release candidate. The
> changeset is minimal but very important.
>
> Cheers,
> Gyula
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-19 Thread Rui Fan

Hi Konstantin and Max,

Thanks for your feedback!

Sorry, I forgot to mention the default value of
`restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.

Retrying forever sounds good to me, I have added it to the FLIP:

The default value of
`restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is
Integer.MAX_VALUE.

Best,
Rui

On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels  wrote:

> Hey Rui,
>
> +1 for making exponential backoff the default. I agree with Konstantin
> that retrying forever is a good default for exponential backoff
> because oftentimes the issue will resolve eventually. The purpose of
> exponential backoff is precisely to continue to retry without causing
> too much load. However, I'm not against adding an optional max number
> of retries.
>
> -Max
>
> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf 
> wrote:
> >
> > Hi Rui,
> >
> > Thank you for this proposal and working on this. I also agree that
> > exponential back off makes sense as a new default in general. I think
> > restarting indefinitely (no max attempts) makes sense by default, though,
> > but of course allowing users to change is valuable.
> >
> > So, overall +1.
> >
> > Cheers,
> >
> > Konstantin
> >
> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <1996fan...@gmail.com
> >:
> >
> > > Hi all,
> > >
> > > I would like to start a discussion on FLIP-364: Improve the
> > > restart-strategy[1]
> > >
> > > As we know, the restart-strategy is critical for flink jobs, it mainly
> > > has two functions:
> > > 1. When an exception occurs in the flink job, quickly restart the job
> > > so that the job can return to the running state.
> > > 2. When a job cannot be recovered after frequent restarts within
> > > a certain period of time, Flink will not retry but will fail the job.
> > >
> > > The current restart-strategy support for function 2 has some issues:
> > > 1. The exponential-delay doesn't have the max attempts mechanism,
> > > it means that flink will restart indefinitely even if it fails
> frequently.
> > > 2. For multi-region streaming jobs and all batch jobs, the failure of
> > > each region will increase the total number of job failures by +1,
> > > even if these failures occur at the same time. If the number of
> > > failures increases too quickly, it will be difficult to set a
> reasonable
> > > number of retries.
> > > If the maximum number of failures is set too low, the job can easily
> > > reach the retry limit, causing the job to fail. If set too high, some
> jobs
> > > will never fail.
> > >
> > > In addition, when the above two problems are solved, we can also
> > > discuss whether exponential-delay can replace fixed-delay as the
> > > default restart-strategy. In theory, exponential-delay is smarter and
> > > friendlier than fixed-delay.
> > >
> > > I also thank Zhu Zhu for his suggestions on the option name in
> > > FLINK-32895[2] in advance.
> > >
> > > Looking forward to and welcome everyone's feedback and suggestions,
> thank
> > > you.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> > >
> > > Best,
> > > Rui
> > >
> >
> >
> > --
> > https://twitter.com/snntrable
> > https://github.com/knaufk
>

Re: [ANNOUNCE] The Flink Speed Center and benchmark daily run are back online

2023-10-20 Thread Rui Fan

Thanks for your effort! It's very useful when some new commits affect
performance.

Best,
Rui

On Fri, Oct 20, 2023 at 4:42 PM Yanfei Lei  wrote:

> Thanks for your hard work!
> Looking forward to the daily monitoring being available again soon.
>
> Best,
> Yanfei
>
> Yuan Mei  于2023年10月20日周五 16:19写道：
> >
> > Thank you for your great efforts!
> >
> > Best
> > Yuan
> >
> > On Fri, Oct 20, 2023 at 4:08 PM Sergey Nuyanzin 
> wrote:
> >
> > > Thanks a lot for working on this!
> > >
> > > On Fri, Oct 20, 2023 at 9:27 AM Yangze Guo  wrote:
> > >
> > > > Thanks for the effort, Zhaoqian!
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Fri, Oct 20, 2023 at 2:55 PM Leonard Xu 
> wrote:
> > > > >
> > > > > Thanks Zakelly for the great work.
> > > > >
> > > > > Best,
> > > > > Leonard
> > > > >
> > > > > > 2023年10月19日 下午7:39，Jing Ge  写道：
> > > > > >
> > > > > > Hi Zakelly,
> > > > > >
> > > > > > Thank you for your effort! Really appreciate it!
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > > > > On Thu, Oct 19, 2023 at 12:02 PM Yun Tang 
> wrote:
> > > > > >
> > > > > >> Thanks for Zakelly's great work!
> > > > > >>
> > > > > >>
> > > > > >> Best
> > > > > >> Yun Tang
> > > > > >> 
> > > > > >> From: Piotr Nowojski 
> > > > > >> Sent: Thursday, October 19, 2023 17:56
> > > > > >> To: dev@flink.apache.org 
> > > > > >> Subject: Re: [ANNOUNCE] The Flink Speed Center and benchmark
> daily
> > > > run are
> > > > > >> back online
> > > > > >>
> > > > > >> Thank you!
> > > > > >>
> > > > > >> czw., 19 paź 2023 o 11:31 Konstantin Knauf 
> > > > napisał(a):
> > > > > >>
> > > > > >>> Thanks a lot for working on this!
> > > > > >>>
> > > > > >>> Am Do., 19. Okt. 2023 um 10:24 Uhr schrieb Zakelly Lan <
> > > > > >>> zakelly@gmail.com>:
> > > > > >>>
> > > > >  Hi everyone,
> > > > > 
> > > > >  Flink benchmarks [1] generate daily performance reports in the
> > > > Apache
> > > > >  Flink slack channel (#flink-dev-benchmarks) to detect
> performance
> > > > >  regression [2]. Those benchmarks previously were running on
> > > several
> > > > >  machines donated and maintained by Ververica. Unfortunately,
> those
> > > > >  machines were gone due to account issues [3] and the
> benchmarks
> > > > daily
> > > > >  run stopped since August 24th delaying the release of Flink
> 1.18 a
> > > > >  bit. [4].
> > > > > 
> > > > >  Ververica donated several new machines! After several weeks of
> > > > work, I
> > > > >  have successfully re-established the codespeed panel and
> benchmark
> > > > >  daily run pipelines on them. At this time, we are pleased to
> > > > announce
> > > > >  that the Flink Speed Center and benchmark pipelines are back
> > > online.
> > > > >  These new machines have a more formal management to ensure
> that
> > > > >  previous accidents will not occur in the future.
> > > > > 
> > > > >  What's more, I successfully recovered historical data backed
> up by
> > > > >  Yanfei Lei [5]. So with the old domain [6] redirected to the
> new
> > > > >  machines, the old links that existed in previous records will
> > > still
> > > > be
> > > > >  valid. Besides the benchmarks with Java8 and Java11, I also
> added
> > > a
> > > > >  pipeline for Java17 running daily.
> > > > > 
> > > > >  How to use it:
> > > > >  We also registered a new domain name 'flink-speed.xyz' for
> the
> > > > Flink
> > > > >  Speed Center [7]. It is recommended to use the new domain in
> the
> > > > >  future. Currently, the self-service method of triggering
> > > benchmarks
> > > > is
> > > > >  unavailable considering the lack of resources and potential
> > > > >  vulnerabilities of Jenkins. Please contact one of Apache Flink
> > > PMCs
> > > > to
> > > > >  submit a benchmark. More info is updated on the wiki[8].
> > > > > 
> > > > >  Daily Monitoring:
> > > > >  The performance daily monitoring on the Apache Flink slack
> channel
> > > > [2]
> > > > >  is still unavailable as the benchmark results need more time
> to
> > > > >  stabilize in the new environment. Once the baseline results
> become
> > > > >  available for regression detection, I will enable the daily
> > > > >  monitoring.
> > > > > 
> > > > >  Please feel free to reach out to me if you have any
> suggestions or
> > > > >  questions. Thanks Ververica again for denoting machines!
> > > > > 
> > > > > 
> > > > >  Best,
> > > > >  Zakelly
> > > > > 
> > > > >  [1] https://github.com/apache/flink-benchmarks
> > > > >  [2]
> > > > https://lists.apache.org/thread/zok62sx4m50c79htfp18ymq5vmtgbgxj
> > > > >  [3] https://issues.apache.org/jira/browse/FLINK-33052
> > > > >  [4]
> > > > https://lists.apache.org//thread/5x28rp3zct4p603hm4zdwx6kfr101w38
> > > > >  [5] https://issues.apache.org/jira/browse/FLINK-30890
>

Re: [DISCUSS] Creating Kubernetes Operator release 1.6.1

2023-10-20 Thread Rui Fan

Hi devs,

I checked all commits that the main branch has but release-1.6 branch
doesn't,
all of them are improvements instead of bug fixes. So I will prepare the
1.6.1-rc1 based on the latest release-1.6 branch.

Please let me know if any critical commit is missed at release-1.6 branch,
thanks~

And thanks Gyula's help!

Best,
Rui

On Thu, Oct 19, 2023 at 2:25 PM Gyula Fóra  wrote:

> Thanks Rui,
> I would appreciate your help. Let's sync on the tasks offline.
>
> As for the bugfix commits. We can take a quick look and backport some other
> fixes if needed but we should focus on critical fixes / regressions to make
> the changes minimal given that the 1.7.0 is not so far anyway :)
>
> Cheers,
> Gyula
>
>
> On Thu, Oct 19, 2023 at 8:20 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Gyula,
> >
> > Thank you for driving this discussion!
> >
> > This release seems good to me, I have a question:
> > I see some bugfix commits have been merged into
> > the 1.6-release branch. Does it already contain all
> > recent bugfix commits?
> >
> > Also, you said in the `Kubernetes Operator 1.6.0 release planning`[1]:
> >
> > > I am volunteering as the release manager but if someone else wants to
> do
> > it, I would also be happy to simply give assistance :)
> >
> > It's a minor version. For those who have never released before,
> > a minor version may be a good entry point. Would you mind
> > if I volunteer as the release manager for 1.6.1?
> >
> > [1] https://lists.apache.org/thread/5ynjv18nfoj6rvyhlz1g5y8qtxx6v1gd
> >
> > Best,
> > Rui
> >
> > On Thu, Oct 19, 2023 at 1:06 PM Gyula Fóra  wrote:
> >
> > > Hi All!
> > >
> > > I would like to propose to release the 1.6.1 patch version for the
> > > Kubernetes operator. The release branch currently contains 2-3 critical
> > > fixes for issues that many users have hit over time.
> > >
> > > Making this release now would allow us more time to wrap up and
> finalize
> > > the 1.7.0 release changes (some of which are quite big regarding the
> > > autoscaler)
> > >
> > > If there are no objections I will prepare the release candidate. The
> > > changeset is minimal but very important.
> > >
> > > Cheers,
> > > Gyula
> > >
> >
>

Re: [VOTE] Release 1.18.0, release candidate #3

2023-10-21 Thread Rui Fan

+1(non-binding)

- Downloaded artifacts from dist[1]
- Verified SHA512 checksums
- Verified GPG signatures
- Build the source with java-1.8 and verified the licenses together
- Verified web PR

[1] https://dist.apache.org/repos/dist/dev/flink/flink-1.18.0-rc3/

Best,
Rui

On Fri, Oct 20, 2023 at 10:31 PM Martijn Visser 
wrote:

> +1 (binding)
>
> - Validated hashes
> - Verified signature
> - Verified that no binaries exist in the source archive
> - Build the source with Maven
> - Verified licenses
> - Verified web PR
> - Started a cluster and the Flink SQL client, successfully read and
> wrote with the Kafka connector to Confluent Cloud with AVRO and Schema
> Registry enabled
>
> On Fri, Oct 20, 2023 at 2:55 PM Matthias Pohl
>  wrote:
> >
> > +1 (binding)
> >
> >  * Downloaded artifacts
> >  * Built Flink from sources
> >  * Verified SHA512 checksums GPG signatures
> >  * Compared checkout with provided sources
> >  * Verified pom file versions
> >  * Verified that there are no pom/NOTICE file changes since RC1
> >  * Deployed standalone session cluster and ran WordCount example in batch
> > and streaming: Nothing suspicious in log files found
> >
> > On Thu, Oct 19, 2023 at 3:00 PM Piotr Nowojski 
> wrote:
> >
> > > +1 (binding)
> > >
> > > Best,
> > > Piotrek
> > >
> > > czw., 19 paź 2023 o 09:55 Yun Tang  napisał(a):
> > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > >   *   Build from source code
> > > >   *   Verify the pre-built jar packages were built with JDK8
> > > >   *   Verify FLIP-291 with a standalone cluster, and it works fine
> with
> > > > StateMachine example.
> > > >   *   Checked the signature
> > > >   *   Viewed the PRs.
> > > >
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: Cheng Pan 
> > > > Sent: Thursday, October 19, 2023 14:38
> > > > To: dev@flink.apache.org 
> > > > Subject: RE: [VOTE] Release 1.18.0, release candidate #3
> > > >
> > > > +1 (non-binding)
> > > >
> > > > We(the Apache Kyuubi community), verified that the Kyuubi Flink
> engine
> > > > works well[1] with Flink 1.18.0 RC3.
> > > >
> > > > [1] https://github.com/apache/kyuubi/pull/5465
> > > >
> > > > Thanks,
> > > > Cheng Pan
> > > >
> > > >
> > > > On 2023/10/19 00:26:24 Jing Ge wrote:
> > > > > Hi everyone,
> > > > >
> > > > > Please review and vote on the release candidate #3 for the version
> > > > > 1.18.0, as follows:
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > > >
> > > > > The complete staging area is available for your review, which
> includes:
> > > > >
> > > > > * JIRA release notes [1], and the pull request adding release note
> for
> > > > > users [2]
> > > > > * the official Apache source release and binary convenience
> releases to
> > > > be
> > > > > deployed to dist.apache.org [3], which are signed with the key
> with
> > > > > fingerprint 96AE0E32CBE6E0753CE6 [4],
> > > > > * all artifacts to be deployed to the Maven Central Repository [5],
> > > > > * source code tag "release-1.18.0-rc3" [6],
> > > > > * website pull request listing the new release and adding
> announcement
> > > > blog
> > > > > post [7].
> > > > >
> > > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > > approval, with at least 3 PMC affirmative votes.
> > > > >
> > > > > Best regards,
> > > > > Konstantin, Sergey, Qingsheng, and Jing
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352885
> > > > > [2] https://github.com/apache/flink/pull/23527
> > > > > [3] https://dist.apache.org/repos/dist/dev/flink/flink-1.18.0-rc3/
> > > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > [5]
> > > >
> https://repository.apache.org/content/repositories/orgapacheflink-1662
> > > > > [6]
> https://github.com/apache/flink/releases/tag/release-1.18.0-rc3
> > > > > [7] https://github.com/apache/flink-web/pull/680
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
>

[VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-21 Thread Rui Fan

Hi Everyone,

Please review and vote on the release candidate #1 for the version 1.6.1 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [3]

All artifacts are signed with the
key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [4]

Other links for your review:
* source code tag "release-1.6.1-rc1" [5]
* PR to update the website Downloads page to
include Kubernetes Operator links [6]
* PR to update the doc version of flink-kubernetes-operator[7]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1663/
[3]
https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
[4] https://dist.apache.org/repos/dist/release/flink/KEYS
[5]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.1-rc1
[6] https://github.com/apache/flink-web/pull/690
[7] https://github.com/apache/flink-kubernetes-operator/pull/687
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release

Best,
Rui

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-21 Thread Rui Fan

+1(non-binding)

- Downloaded artifacts from dist
- Verified SHA512 checksums
- Verified GPG signatures
- Build the source with java-11 and verified the licenses together
- Verified that all POM files point to the same version.
- Verified that chart and appVersion matches the target release
- Verified that helm chart / values.yaml points to the RC docker image
- Verified that RC repo works as Helm repo (helm repo add
flink-operator-repo-1.6.1-rc1
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
)
- Verified Helm chart can be installed  (helm install
flink-kubernetes-operator
flink-operator-repo-1.6.1-rc1/flink-kubernetes-operator --set
webhook.create=false)
- Submitted the autoscaling demo, the autoscaler works well (kubectl apply
-f autoscaling.yaml)
- Triggered a manual savepoint (update the yaml: savepointTriggerNonce: 101)

Best,
Rui

On Sat, Oct 21, 2023 at 7:33 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Everyone,
>
> Please review and vote on the release candidate #1 for the version 1.6.1 of
> Apache Flink Kubernetes Operator,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> **Release Overview**
>
> As an overview, the release consists of the following:
> a) Kubernetes Operator canonical source distribution (including the
> Dockerfile), to be deployed to the release repository at dist.apache.org
> b) Kubernetes Operator Helm Chart to be deployed to the release repository
> at dist.apache.org
> c) Maven artifacts to be deployed to the Maven Central Repository
> d) Docker image to be pushed to dockerhub
>
> **Staging Areas to Review**
>
> The staging areas containing the above mentioned artifacts are as follows,
> for your review:
> * All artifacts for a,b) can be found in the corresponding dev repository
> at dist.apache.org [1]
> * All artifacts for c) can be found at the Apache Nexus Repository [2]
> * The docker image for d) is staged on github [3]
>
> All artifacts are signed with the
> key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [4]
>
> Other links for your review:
> * source code tag "release-1.6.1-rc1" [5]
> * PR to update the website Downloads page to
> include Kubernetes Operator links [6]
> * PR to update the doc version of flink-kubernetes-operator[7]
>
> **Vote Duration**
>
> The voting time will run for at least 72 hours.
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
> **Note on Verification**
>
> You can follow the basic verification guide here[8].
> Note that you don't need to verify everything yourself, but please make
> note of what you have tested together with your +- vote.
>
> [1]
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
> [2]
> https://repository.apache.org/content/repositories/orgapacheflink-1663/
> [3]
> https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> [5]
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.1-rc1
> [6] https://github.com/apache/flink-web/pull/690
> [7] https://github.com/apache/flink-kubernetes-operator/pull/687
> [8]
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
>
> Best,
> Rui
>

Re: [VOTE] FLIP-370: Support Balanced Tasks Scheduling

2023-10-22 Thread Rui Fan

+1(binding)

Thanks to Yuepeng and to everyone who participated in the discussion!

Best,
Rui

On Mon, Oct 23, 2023 at 11:55 AM Roc Marshal  wrote:

> Hi all,
>
> Thanks for all the feedback on FLIP-370[1][2].
> I'd like to start a vote for  FLIP-370. The vote will last for at least 72
> hours (Oct. 26th at 10:00 A.M. GMT) unless there is an objection or
> insufficient votes.
>
> Thanks,
> Yuepeng Pan
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling
> [2] https://lists.apache.org/thread/mx3ot0fmk6zr02ccdby0s8oqxqm2pn1x
>

Re: Request to release flink 1.6.3

2023-10-24 Thread Rui Fan

Hi Vikas,

Thanks for your feedback!

Do you mean flink 1.16.3 instead of 1.6.3?

The 1.16.2 and 1.17.1 were released on 2023-05-25,
it’s been 5 months. And the flink community has fixed
many bugs in the past 5 months. Usually, there is a
fix(minor) version every three or four months, so I propose
to release 1.16.3 and 1.17.2 now.

If the community agrees to create this new patch release, I could
volunteer as the release manager for one of the 1.16.3 or 1.17.2.

Looking forward to feedback from the community, thank you

[1]
https://flink.apache.org/2023/05/25/apache-flink-1.16.2-release-announcement/
[2]
https://flink.apache.org/2023/05/25/apache-flink-1.17.1-release-announcement/

Best,
Rui

On Tue, Oct 24, 2023 at 9:50 PM vikas patil  wrote:

> Hello All,
>
> Facing this FLINK-28185  >
> issue for one of the flink jobs. We are running flink version 1.6.1 but it
> looks like the backport  to
> 1.6
> was never released as 1.6.3. The latest that was released is 1.6.2
> . By any chance we can get
> the 1.6.3 released ?
>
> Also we use the official flink docker 
> image. Not sure if that needs to be updated as well manually. Thanks.
>
> -Vikas
>

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-25 Thread Rui Fan

Hi Thomas,

Thanks for your verification and feedback!

I tried to build the flink-kubernetes-operator project with Java 17,
it's really not supported right now.

Offline discussion with Gyula, we hope Kubernetes operator supports
compiling with Java 17 as a critical ticket in 1.7.0. I created the
FLINK-33359[1] to follow it.

[1] https://issues.apache.org/jira/browse/FLINK-33359

Best,
Rui

On Wed, Oct 25, 2023 at 8:30 AM Thomas Weise  wrote:

> +1 (binding)
>
> - Verified checksums, signatures, source release content
> - Run unit tests
>
> Side note:   mvn clean verifyfails with Java 17 compiler. While the
> build target version may be 11, preferably a higher JDK version can be used
> to build the source.
>
>  Caused by: java.lang.IllegalAccessError: class
> com.google.googlejavaformat.java.RemoveUnusedImports (in unnamed module
> @0x44f433db) cannot access class com.sun.tools.javac.util.Context (in
> module jdk.compiler) because module jdk.compiler does not export
> com.sun.tools.javac.util to unnamed module @0x44f433db
>
> at
>
> com.google.googlejavaformat.java.RemoveUnusedImports.removeUnusedImports(RemoveUnusedImports.java:187)
>
> Thanks,
> Thomas
>
>
> On Sat, Oct 21, 2023 at 7:35 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > Please review and vote on the release candidate #1 for the version 1.6.1
> of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [3]
> >
> > All artifacts are signed with the
> > key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [4]
> >
> > Other links for your review:
> > * source code tag "release-1.6.1-rc1" [5]
> > * PR to update the website Downloads page to
> > include Kubernetes Operator links [6]
> > * PR to update the doc version of flink-kubernetes-operator[7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > [1]
> >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1663/
> > [3]
> >
> >
> https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.1-rc1
> > [6] https://github.com/apache/flink-web/pull/690
> > [7] https://github.com/apache/flink-kubernetes-operator/pull/687
> > [8]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
> > Best,
> > Rui
> >
>

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-26 Thread Rui Fan

@David, thanks for the reminder, I will figure out the steps
I missed later and pay more attention in the future.

@Max, My key has already been added to KEYS.

@EveryOne Thank you everyone for testing this!

Closing the vote now, the results will be announced in a separate email.

Best,
Rui

On Thu, Oct 26, 2023 at 9:44 PM Maximilian Michels  wrote:

> +1 (binding)
>
> 1. Downloaded the archives, checksums, and signatures
> 2. Verified the signatures and checksums ( gpg --recv-keys
> B2D64016B940A7E0B9B72E0D7D0528B28037D8BC )
> 3. Extract and inspect the source code for binaries
> 4. Compiled and tested the source code via mvn verify
> 5. Verified license files / headers
> 6. Deployed helm chart to test cluster
> 7. Ran example job
> 8. Tested autoscaling without rescaling API
>
> @Rui Can you add your key to
> https://dist.apache.org/repos/dist/release/flink/KEYS ?
>
> -Max
>
> On Thu, Oct 26, 2023 at 1:53 PM Márton Balassi 
> wrote:
> >
> > Thank you, team. @David Radley: Not having Rui's key signed is not ideal,
> > but is acceptable for the release.
> >
> > +1 (binding)
> >
> > - Verified Helm repo works as expected, points to correct image tag,
> build,
> > version
> > - Verified basic examples + checked operator logs everything looks as
> > expected
> > - Verified hashes, signatures and source release contains no binaries
> > - Ran built-in tests, built jars + docker image from source successfully
> >
> > Best,
> > Marton
> >
> > On Thu, Oct 26, 2023 at 12:24 PM David Radley 
> > wrote:
> >
> > > Hi,
> > > I downloaded the artifacts.
> > >
> > >   *   I did an install of the operator and ran the basic sample
> > >   *   I checked the checksums
> > >   *   Checked the GPG signatures
> > >   *   Ran the UI
> > >   *   Ran a Twistlock scan
> > >   *   I installed 1.6 then did a helm upgrade
> > >   *   I have not managed to do the source build and subsequent install
> yet.
> > > I wanted to check these 2 things are what you would expect:
> > >
> > >   1.  I followed link
> > >
> https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> > > And notice that it does not have a description . Is this correct?
> > >
> > >   1.  I get this in the gpg verification . Is this ok?
> > >
> > >
> > > gpg --verify flink-kubernetes-operator-1.6.1-src.tgz.asc
> > >
> > > gpg: assuming signed data in 'flink-kubernetes-operator-1.6.1-src.tgz'
> > >
> > > gpg: Signature made Fri 20 Oct 2023 04:07:48 PDT
> > >
> > > gpg:using RSA key
> B2D64016B940A7E0B9B72E0D7D0528B28037D8BC
> > >
> > > gpg: Good signature from "Rui Fan fan...@apache.org > > fan...@apache.org>" [unknown]
> > >
> > > gpg: WARNING: This key is not certified with a trusted signature!
> > >
> > > gpg:  There is no indication that the signature belongs to the
> > > owner.
> > >
> > > Primary key fingerprint: B2D6 4016 B940 A7E0 B9B7  2E0D 7D05 28B2 8037
> D8BC
> > >
> > >
> > >
> > >
> > > Hi Everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> 1.6.1 of
> > > Apache Flink Kubernetes Operator,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > **Release Overview**
> > >
> > > As an overview, the release consists of the following:
> > > a) Kubernetes Operator canonical source distribution (including the
> > > Dockerfile), to be deployed to the release repository at
> dist.apache.org
> > > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > > at dist.apache.org
> > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > d) Docker image to be pushed to dockerhub
> > >
> > > **Staging Areas to Review**
> > >
> > > The staging areas containing the above mentioned artifacts are as
> follows,
> > > for your review:
> > > * All artifacts for a,b) can be found in the corresponding dev
> repository
> > > at dist.apache.org [1]
> > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > > * The docker image for d) is staged on github [3]
> > >
> > > All artifacts are signed w

[RESULT] [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-26 Thread Rui Fan

I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 4 of which are binding:

* Gyula Fora (binding)
* Thomas Weise (binding)
* Marton Balassi (binding)
* Maximilian Mixhels (binding)
* Mate Czagany (non-binding)
* Samrat Deb (non-binding)
* Rui Fan (non-binding)

And David Radley (non-binding) +0.

There are no disapproving votes.

Thanks everyone!

Best,
Rui

Re: [ANNOUNCE] Apache Flink 1.18.0 released

2023-10-26 Thread Rui Fan

Thanks for the great work!

Best,
Rui

On Fri, Oct 27, 2023 at 10:03 AM Paul Lam  wrote:

> Finally! Thanks to all!
>
> Best,
> Paul Lam
>
> > 2023年10月27日 03:58，Alexander Fedulov  写道：
> >
> > Great work, thanks everyone!
> >
> > Best,
> > Alexander
> >
> > On Thu, 26 Oct 2023 at 21:15, Martijn Visser 
> > wrote:
> >
> >> Thank you all who have contributed!
> >>
> >> Op do 26 okt 2023 om 18:41 schreef Feng Jin 
> >>
> >>> Thanks for the great work! Congratulations
> >>>
> >>>
> >>> Best,
> >>> Feng Jin
> >>>
> >>> On Fri, Oct 27, 2023 at 12:36 AM Leonard Xu  wrote:
> >>>
>  Congratulations, Well done!
> 
>  Best,
>  Leonard
> 
>  On Fri, Oct 27, 2023 at 12:23 AM Lincoln Lee 
>  wrote:
> 
> > Thanks for the great work! Congrats all!
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Jing Ge  于2023年10月27日周五 00:16写道：
> >
> >> The Apache Flink community is very happy to announce the release of
> > Apache
> >> Flink 1.18.0, which is the first release for the Apache Flink 1.18
> > series.
> >>
> >> Apache Flink® is an open-source unified stream and batch data
>  processing
> >> framework for distributed, high-performing, always-available, and
> > accurate
> >> data applications.
> >>
> >> The release is available for download at:
> >> https://flink.apache.org/downloads.html
> >>
> >> Please check out the release blog post for an overview of the
> > improvements
> >> for this release:
> >>
> >>
> >
> 
> >>>
> >>
> https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/
> >>
> >> The full release notes are available in Jira:
> >>
> >>
> >
> 
> >>>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352885
> >>
> >> We would like to thank all contributors of the Apache Flink
> >> community
>  who
> >> made this release possible!
> >>
> >> Best regards,
> >> Konstantin, Qingsheng, Sergey, and Jing
> >>
> >
> 
> >>>
> >>
>
>

[ANNOUNCE] Apache Flink Kubernetes Operator 1.6.1 released

2023-10-30 Thread Rui Fan

The Apache Flink community is very happy to announce the release of Apache
Flink Kubernetes Operator 1.6.1.

Please check out the release blog post for an overview of the release:
https://flink.apache.org/2023/10/27/apache-flink-kubernetes-operator-1.6.1-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator applications can be
found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353784

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Rui Fan

Re: [DISCUSS] Kubernetes Operator 1.7.0 release planning

2023-10-31 Thread Rui Fan

Thanks Gyula for driving this release!

I'd like to check with you and community, could we
postpone the code freeze by a week?

I'm developing the FLINK-33099[1], and the prod code is done.
I need some time to develop the tests. I hope this feature is included in
1.7.0 for two main reasons:

1. We have completed the decoupling of the autoscaler and
kubernetes-operator in 1.7.0. During the decoupling period, we modified
a large number of autoscaler-related interfaces. The standalone autoscaler
is an autoscaler process that can run independently. It can help us confirm
whether the new interface is reasonable.
2. 1.18.0 was recently released, standalone autoscaler allows more users to
play autoscaler and in-place rescale.

I have created a draft PR[2] for FLINK-33099, it just includes prod code.
I have run it manually, it works well. And I will try my best to finish all
unit tests before Friday, but must finish all unit tests before next Monday
at the latest.

WDYT?

I'm deeply sorry for the request to postpone the release.

[1] https://issues.apache.org/jira/browse/FLINK-33099
[2] https://github.com/apache/flink-kubernetes-operator/pull/698

Best,
Rui

On Tue, Oct 31, 2023 at 4:10 PM Samrat Deb  wrote:

> Thank you Gyula
>
> (+1 non-binding) in support of you taking on the role of release manager.
>
> > I think this is reasonable as I am not aware of any big features / bug
> fixes being worked on right now. Given the size of the changes related to
> the autoscaler module refactor we should try to focus the remaining time on
> testing.
>
> I completely agree with you. Since the changes are quite extensive, it's
> crucial to allocate more time for thorough testing and verification.
>
> Regarding working with you for the release, I might not have the necessary
> privileges for that.
>
> However, I'd be more than willing to assist with testing the changes,
> validating various features, and checking for any potential regressions in
> the flink-kubernetes-operator.
> Just let me know how I can support the testing efforts.
>
> Bests,
> Samrat
>
>
> On Tue, 31 Oct 2023 at 12:59 AM, Gyula Fóra  wrote:
>
> > Hi all!
> >
> > I would like to kick off the release planning for the operator 1.7.0
> > release. We have added quite a lot of new functionality over the last few
> > weeks and I think the operator is in a good state to kick this off.
> >
> > Based on the original release schedule we had Nov 1 as the proposed
> feature
> > freeze date and Nov 7 as the date for the release cut / rc1.
> >
> > I think this is reasonable as I am not aware of any big features / bug
> > fixes being worked on right now. Given the size of the changes related to
> > the autoscaler module refactor we should try to focus the remaining time
> on
> > testing.
> >
> > I am happy to volunteer as a release manager but I am of course open to
> > working together with someone on this.
> >
> > What do you think?
> >
> > Cheers,
> > Gyula
> >
>

Re: Request to release flink 1.6.3

2023-10-31 Thread Rui Fan

Thanks Vikas for the ask!

Hi devs,

Is anyone willing to pick up the release of 1.16.3 and 1.17.2 with me? If
so, I can volunteer to release one of the versions. If no one picks it up
for more than three days, I volunteer to release two versions. After it’s
determined, the official discussion can be started.

Looking forward to other committer or  PMC join this release!


Hi Matthias,

Thank you for sorting out these 2 lists. 1.16.3 may be the final version of
1.16 series, it makes sense to sort out all left over issues.

Also, some of release processes need PMC  permission, would you mind
helping it? Thanks~

Best,
Rui

On Tue, 31 Oct 2023 at 21:42, vikas patil  wrote:

> Hello Rui,
>
> Do we need more votes for this or are we good to go with the release of
> 1.6.3 ? Please let me know. Thanks.
>
> -Vikas
>
> On Tue, Oct 24, 2023 at 9:27 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Vikas,
> >
> > Thanks for your feedback!
> >
> > Do you mean flink 1.16.3 instead of 1.6.3?
> >
> > The 1.16.2 and 1.17.1 were released on 2023-05-25,
> > it’s been 5 months. And the flink community has fixed
> > many bugs in the past 5 months. Usually, there is a
> > fix(minor) version every three or four months, so I propose
> > to release 1.16.3 and 1.17.2 now.
> >
> > If the community agrees to create this new patch release, I could
> > volunteer as the release manager for one of the 1.16.3 or 1.17.2.
> >
> > Looking forward to feedback from the community, thank you
> >
> > [1]
> >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.16.2-release-announcement/
> > [2]
> >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.17.1-release-announcement/
> >
> > Best,
> > Rui
> >
> > On Tue, Oct 24, 2023 at 9:50 PM vikas patil 
> > wrote:
> >
> > > Hello All,
> > >
> > > Facing this FLINK-28185 <
> > https://issues.apache.org/jira/browse/FLINK-28185
> > > >
> > > issue for one of the flink jobs. We are running flink version 1.6.1 but
> > it
> > > looks like the backport <https://github.com/apache/flink/pull/22505>
> to
> > > 1.6
> > > was never released as 1.6.3. The latest that was released is 1.6.2
> > > <https://dlcdn.apache.org/flink/flink-1.16.2/>. By any chance we can
> get
> > > the 1.6.3 released ?
> > >
> > > Also we use the official flink docker <https://hub.docker.com/_/flink/
> >
> > > image. Not sure if that needs to be updated as well manually. Thanks.
> > >
> > > -Vikas
> > >
> >
>

Re: [DISCUSS] FLIP-381: Deprecate configuration getters/setters that return/set complex Java objects

2023-11-01 Thread Rui Fan

Thanks Junrui for driving this proposal!

ConfigOption is easy to use for flink users, easy to manage options
for flink platform maintainers, and easy to maintain for flink developers
and flink community.

So big +1 for this proposal!

Best,
Rui

On Thu, Nov 2, 2023 at 10:10 AM Junrui Lee  wrote:

> Hi devs,
>
> I would like to start a discussion on FLIP-381: Deprecate configuration
> getters/setters that return/set complex Java objects[1].
>
> Currently, the job configuration in FLINK is spread out across different
> components, which leads to inconsistencies and confusion. To address this
> issue, it is necessary to migrate non-ConfigOption complex Java objects to
> use ConfigOption and adopt a single Configuration object to host all the
> configuration.
> However, there is a significant blocker in implementing this solution.
> These complex Java objects in StreamExecutionEnvironment, CheckpointConfig,
> and ExecutionConfig have already been exposed through the public API,
> making it challenging to modify the existing implementation.
>
> Therefore, I propose to deprecate these Java objects and their
> corresponding getter/setter interfaces, ultimately removing them in
> FLINK-2.0.
>
> Your feedback and thoughts on this proposal are highly appreciated.
>
> Best regards,
> Junrui Lee
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278464992
>

Re: Request to release flink 1.6.3

2023-11-03 Thread Rui Fan

Thanks Yun and Yu,

That would be great! Let's move forward!

Best,
Rui

On Sat, Nov 4, 2023 at 10:10 AM Yun Tang  wrote:

> Hi Rui,
>
> I could help to join the release management of 1.17.2, and I will also ask
> Yu Chen for help with some working items.
>
> Let's move forward and make it.
>
> Best
> Yun Tang
> ________
> From: Rui Fan <1996fan...@gmail.com>
> Sent: Tuesday, October 31, 2023 23:06
> To: dev@flink.apache.org 
> Subject: Re: Request to release flink 1.6.3
>
> Thanks Vikas for the ask!
>
> Hi devs,
>
> Is anyone willing to pick up the release of 1.16.3 and 1.17.2 with me? If
> so, I can volunteer to release one of the versions. If no one picks it up
> for more than three days, I volunteer to release two versions. After it’s
> determined, the official discussion can be started.
>
> Looking forward to other committer or  PMC join this release!
>
>
> Hi Matthias,
>
> Thank you for sorting out these 2 lists. 1.16.3 may be the final version of
> 1.16 series, it makes sense to sort out all left over issues.
>
> Also, some of release processes need PMC  permission, would you mind
> helping it? Thanks~
>
> Best,
> Rui
>
> On Tue, 31 Oct 2023 at 21:42, vikas patil  wrote:
>
> > Hello Rui,
> >
> > Do we need more votes for this or are we good to go with the release of
> > 1.6.3 ? Please let me know. Thanks.
> >
> > -Vikas
> >
> > On Tue, Oct 24, 2023 at 9:27 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Vikas,
> > >
> > > Thanks for your feedback!
> > >
> > > Do you mean flink 1.16.3 instead of 1.6.3?
> > >
> > > The 1.16.2 and 1.17.1 were released on 2023-05-25,
> > > it’s been 5 months. And the flink community has fixed
> > > many bugs in the past 5 months. Usually, there is a
> > > fix(minor) version every three or four months, so I propose
> > > to release 1.16.3 and 1.17.2 now.
> > >
> > > If the community agrees to create this new patch release, I could
> > > volunteer as the release manager for one of the 1.16.3 or 1.17.2.
> > >
> > > Looking forward to feedback from the community, thank you
> > >
> > > [1]
> > >
> > >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.16.2-release-announcement/
> > > [2]
> > >
> > >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.17.1-release-announcement/
> > >
> > > Best,
> > > Rui
> > >
> > > On Tue, Oct 24, 2023 at 9:50 PM vikas patil 
> > > wrote:
> > >
> > > > Hello All,
> > > >
> > > > Facing this FLINK-28185 <
> > > https://issues.apache.org/jira/browse/FLINK-28185
> > > > >
> > > > issue for one of the flink jobs. We are running flink version 1.6.1
> but
> > > it
> > > > looks like the backport <https://github.com/apache/flink/pull/22505>
> > to
> > > > 1.6
> > > > was never released as 1.6.3. The latest that was released is 1.6.2
> > > > <https://dlcdn.apache.org/flink/flink-1.16.2/>. By any chance we can
> > get
> > > > the 1.6.3 released ?
> > > >
> > > > Also we use the official flink docker <
> https://hub.docker.com/_/flink/
> > >
> > > > image. Not sure if that needs to be updated as well manually. Thanks.
> > > >
> > > > -Vikas
> > > >
> > >
> >
>

Re: Request to release flink 1.6.3

2023-11-04 Thread Rui Fan

Hi Jing,

Thanks for your feedback.

Yun, Yu and me have a offline discussion about release 1.16.3 and 1.17.2,
we will start 2 discuss tomorrow. And I will consider 1.16.3 as the final
release of 1.16.

Best,
Rui

On Sun, 5 Nov 2023 at 00:49, Jing Ge  wrote:

> Hi folks,
>
> Just like Matthias mentioned in his reply and Rui followed. Since 1.18.0
> has been released, I'd suggest that we vote to make 1.16.3 the final bugix
> release of 1.16, ask and wait for any objections. WDYT? Background info
> could be found at [1].
>
> Best regards,
> Jing
>
> [1] https://lists.apache.org/thread/szq23kr3rlkm80rw7k9n95js5vqpsnbv
>
>
> On Fri, Nov 3, 2023 at 7:38 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks Yun and Yu,
> >
> > That would be great! Let's move forward!
> >
> > Best,
> > Rui
> >
> > On Sat, Nov 4, 2023 at 10:10 AM Yun Tang  wrote:
> >
> > > Hi Rui,
> > >
> > > I could help to join the release management of 1.17.2, and I will also
> > ask
> > > Yu Chen for help with some working items.
> > >
> > > Let's move forward and make it.
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Rui Fan <1996fan...@gmail.com>
> > > Sent: Tuesday, October 31, 2023 23:06
> > > To: dev@flink.apache.org 
> > > Subject: Re: Request to release flink 1.6.3
> > >
> > > Thanks Vikas for the ask!
> > >
> > > Hi devs,
> > >
> > > Is anyone willing to pick up the release of 1.16.3 and 1.17.2 with me?
> If
> > > so, I can volunteer to release one of the versions. If no one picks it
> up
> > > for more than three days, I volunteer to release two versions. After
> it’s
> > > determined, the official discussion can be started.
> > >
> > > Looking forward to other committer or  PMC join this release!
> > >
> > >
> > > Hi Matthias,
> > >
> > > Thank you for sorting out these 2 lists. 1.16.3 may be the final
> version
> > of
> > > 1.16 series, it makes sense to sort out all left over issues.
> > >
> > > Also, some of release processes need PMC  permission, would you mind
> > > helping it? Thanks~
> > >
> > > Best,
> > > Rui
> > >
> > > On Tue, 31 Oct 2023 at 21:42, vikas patil 
> > wrote:
> > >
> > > > Hello Rui,
> > > >
> > > > Do we need more votes for this or are we good to go with the release
> of
> > > > 1.6.3 ? Please let me know. Thanks.
> > > >
> > > > -Vikas
> > > >
> > > > On Tue, Oct 24, 2023 at 9:27 AM Rui Fan <1996fan...@gmail.com>
> wrote:
> > > >
> > > > > Hi Vikas,
> > > > >
> > > > > Thanks for your feedback!
> > > > >
> > > > > Do you mean flink 1.16.3 instead of 1.6.3?
> > > > >
> > > > > The 1.16.2 and 1.17.1 were released on 2023-05-25,
> > > > > it’s been 5 months. And the flink community has fixed
> > > > > many bugs in the past 5 months. Usually, there is a
> > > > > fix(minor) version every three or four months, so I propose
> > > > > to release 1.16.3 and 1.17.2 now.
> > > > >
> > > > > If the community agrees to create this new patch release, I could
> > > > > volunteer as the release manager for one of the 1.16.3 or 1.17.2.
> > > > >
> > > > > Looking forward to feedback from the community, thank you
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.16.2-release-announcement/
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/2023/05/25/apache-flink-1.17.1-release-announcement/
> > > > >
> > > > > Best,
> > > > > Rui
> > > > >
> > > > > On Tue, Oct 24, 2023 at 9:50 PM vikas patil <
> vikas.a.pa...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hello All,
> > > > > >
> > > > > > Facing this FLINK-28185 <
> > > > > https://issues.apache.org/jira/browse/FLINK-28185
> > > > > > >
> > > > > > issue for one of the flink jobs. We are running flink version
> 1.6.1
> > > but
> > > > > it
> > > > > > looks like the backport <
> > https://github.com/apache/flink/pull/22505>
> > > > to
> > > > > > 1.6
> > > > > > was never released as 1.6.3. The latest that was released is
> 1.6.2
> > > > > > <https://dlcdn.apache.org/flink/flink-1.16.2/>. By any chance we
> > can
> > > > get
> > > > > > the 1.6.3 released ?
> > > > > >
> > > > > > Also we use the official flink docker <
> > > https://hub.docker.com/_/flink/
> > > > >
> > > > > > image. Not sure if that needs to be updated as well manually.
> > Thanks.
> > > > > >
> > > > > > -Vikas
> > > > > >
> > > > >
> > > >
> > >
> >
>

[DISCUSS] Release Flink 1.16.3

2023-11-05 Thread Rui Fan

Hi all,

I would like to discuss creating a new 1.16 patch release (1.16.3). The
last 1.16 release is over five months old, and since then, 50 tickets have
been closed [1], of which 10 are blocker/critical [2]. Some
of them are quite important, such as FLINK-32296 [3], FLINK-32548 [4]
and FLINK-33010[5].

In addition to this, FLINK-33149 [6] is important to bump snappy-java to
1.1.10.4.
Although FLINK-33149 is unresolved, it was done in 1.16.3.

I am not aware of any unresolved blockers and there are no in-progress
tickets [7]. Please let me know if there are any issues you'd like to be
included in this release but still not merged.

Since 1.18.0 has been released, I'd suggest that we vote to make 1.16.3
the final bugix release of 1.16, looking forward to any feedback from you.
Background info could be found at [8], and thanks Jing for the information.

If the community agrees to create this new patch release, I could
volunteer as the release manager.

Since there will be another flink-1.17.2 release request during the same
time,
I will work with Yun and Yu since many issues will be fixed in both
releases.

[1]
https://issues.apache.org/jira/browse/FLINK-32231?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.16.3%20%20and%20resolution%20%20!%3D%20%20Unresolved%20order%20by%20priority%20DESC
[2]
https://issues.apache.org/jira/browse/FLINK-32231?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.16.3%20and%20resolution%20%20!%3D%20Unresolved%20%20and%20priority%20in%20(Blocker%2C%20Critical)%20ORDER%20by%20priority%20%20DESC
[3] https://issues.apache.org/jira/browse/FLINK-32296
[4] https://issues.apache.org/jira/browse/FLINK-32548
[5] https://issues.apache.org/jira/browse/FLINK-33010
[6] https://issues.apache.org/jira/browse/FLINK-33149
[7] https://issues.apache.org/jira/projects/FLINK/versions/12353259
[8] https://lists.apache.org/thread/szq23kr3rlkm80rw7k9n95js5vqpsnbv

Best,
Rui

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-07 Thread Rui Fan

Hi Piotr,

Thanks for driving this proposal! The trace reporter is useful to
check a lot of duration monitors inside of Flink.

I have some questions about this proposal:

1. I see the trace just supports Span? Does it support trace events?
I'm not sure whether tracing events is reasonable for TraceReporter.
If it supports, flink can report checkpoint and checkpoint path proactively.
Currently, checkpoint lists or the latest checkpoint can only be fetched
by external components or platforms. And report is more timely and
efficient than fetch.

2. This FLIP just monitors the checkpoint and task recovery, right?
Could we add more operations in this FLIP? In our production, we
added a lot of trace reporters for job starts and scheduler operation.
They are useful if some jobs start slowly, because they will affect
the job availability. For example:
- From JobManager process is started to JobGraph is created
- From JobGraph is created to JobMaster is created
- From JobMaster is created to job is running
- From start request tm from yarn or kubernetes to all tms are ready
- etc

Of course, this FLIP doesn't include them is fine for me. The first version
only initializes the interface and common operations, and we can add
more operations in the future.

Best,
Rui

On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski  wrote:

> Hi all!
>
> I would like to start a discussion on FLIP-384: Introduce TraceReporter and
> use it to create checkpointing and recovery traces [1].
>
> This proposal intends to improve observability of Flink's Checkpointing and
> Recovery/Initialization operations, by adding support for reporting traces
> from Flink. In the future, reporting traces can be of course used for other
> use cases and also by users.
>
> There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386 [3],
> which expand the basic functionality introduced in FLIP-384 [1].
>
> Please let me know what you think!
>
> Best,
> Piotr Nowojski
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> [3]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
>

[DISCUSS] FLIP-390: Support System out and err to be redirected to LOG or discarded

2023-11-08 Thread Rui Fan

Hi all!

I would like to start a discussion of FLIP-390: Support System out and err
to be redirected to LOG or discarded[1].

In various production environments, either cloud native or physical
machines, the disk space that Flink TaskManager can use is limited.

In general, the flink users shouldn't use the `System.out.println` in
production,
however this may happen when the number of Flink jobs and job developers
is very large. Flink job may use System.out to output a large amount of
data
to the taskmanager.out file. This file will not roll, it will always
increment.
Eventually the upper limit of what the TM can be used for is reached.

We can support System out and err to be redirected to LOG or discarded,
the LOG can roll and won't increment forever.

This feature is useful for SREs who maintain Flink environments, they can
redirect System.out to LOG by default. Although the cause of this problem
is
that the user's code is not standardized, for SRE, pushing users to modify
the code one by one is usually a very time-consuming operation. It's also
useful for job stability where System.out is accidentally used.

Looking forward to your feedback, thanks~

[1] https://cwiki.apache.org/confluence/x/4guZE

Best,
Rui

Re: [DISCUSS] Release Flink 1.16.3

2023-11-08 Thread Rui Fan

Hi All,

Thank you for your feedback!

As there are no other concerns or objections, and currently
I am not aware of any unresolved blockers.

I will kick off the release process and start preparing for the
RC1 version from today.

Best,
Rui

On Wed, Nov 8, 2023 at 4:23 PM ConradJam  wrote:

> +1
>
> Sergey Nuyanzin  于2023年11月8日周三 13:08写道：
>
> > +1 for the final release
> > and thanks for the efforts
> >
> > On Wed, Nov 8, 2023 at 4:09 AM Leonard Xu  wrote:
> >
> > > Thanks Rui for driving this.
> > >
> > > +1 to release 1.16.3 and make it as the  final bugix release of 1.16
> > > series.
> > >
> > > Best,
> > > Leonard
> > >
> >
> >
> > --
> > Best regards,
> > Sergey
> >
>
>
> --
> Best
>
> ConradJam
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-08 Thread Rui Fan

could
> > > re-generate those ids
> > > and subtasks' checkpoint span would have an id of
> > > `jobId#attemptId#checkpointId#subTaskId`.
> > > Note that this is just an example, as most likely distributed spans for
> > > checkpointing do not make
> > > sense, as we can generate them much easier on the JM anyway.
> > >
> > > > In addition to checkpoint and recovery, I believe the trace would
> also
> > be
> > > > valuable for performance tuning. If Flink can trace and visualize the
> > > time
> > > > cost of each operator and stage for a sampled record, users would be
> > able
> > > > to easily determine the end-to-end latency and identify performance
> > > issues
> > > > for optimization. Looking forward to seeing these in the future.
> > >
> > > I'm not sure if I understand the proposal - I don't know how traces
> could
> > > be used for this purpose?
> > > Traces are perfect for one of events (like checkpointing, recovery,
> etc),
> > > not for continuous monitoring
> > > (like processing records). That's what metrics are. Creating trace and
> > > span(s) per each record would
> > > be prohibitively expensive.
> > >
> > > Unless you mean in batch/bounded jobs? Then yes, we could create a
> > bounded
> > > job trace, with spans
> > > for every stage/task/subtask.
> > >
> > > Best,
> > > Piotrek
> > >
> > >
> > > śr., 8 lis 2023 o 05:30 Zakelly Lan 
> napisał(a):
> > >
> > > > Hi Piotr,
> > > >
> > > > Happy to see the trace! Thanks for this proposal.
> > > >
> > > > One minor question: It is mentioned in the interface of Span:
> > > >
> > > > Currently we don't support traces with multiple spans. Each span is
> > > > > self-contained and represents things like a checkpoint or recovery.
> > > >
> > > >
> > > > Does it mean the inclusion and subdivision relationships of spans
> > defined
> > > > by "parent_id" are not supported? I think it is a very necessary
> > feature
> > > > for the trace.
> > > >
> > > > In addition to checkpoint and recovery, I believe the trace would
> also
> > be
> > > > valuable for performance tuning. If Flink can trace and visualize the
> > > time
> > > > cost of each operator and stage for a sampled record, users would be
> > able
> > > > to easily determine the end-to-end latency and identify performance
> > > issues
> > > > for optimization. Looking forward to seeing these in the future.
> > > >
> > > > Best,
> > > > Zakelly
> > > >
> > > >
> > > > On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski 
> > > > wrote:
> > > >
> > > > > Hi Rui,
> > > > >
> > > > > Thanks for the comments!
> > > > >
> > > > > > 1. I see the trace just supports Span? Does it support trace
> > events?
> > > > > > I'm not sure whether tracing events is reasonable for
> > TraceReporter.
> > > > > > If it supports, flink can report checkpoint and checkpoint path
> > > > > proactively.
> > > > > > Currently, checkpoint lists or the latest checkpoint can only be
> > > > fetched
> > > > > > by external components or platforms. And report is more timely
> and
> > > > > > efficient than fetch.
> > > > >
> > > > > No, currently the `TraceReporter` that I'm introducing supports
> only
> > > > single
> > > > > span traces.
> > > > > So currently neither events on their own, nor events inside spans
> are
> > > not
> > > > > supported.
> > > > > This is done just for the sake of simplicity, and test out the
> basic
> > > > > functionality. But I think,
> > > > > those currently missing features should be added at some point in
> > > > > the future.
> > > > >
> > > > > About structured logging (basically events?) I vaguely remember
> some
> > > > > discussions about
> > > > > that. It might be a much larger topic, so I would prefer to leave
> it
> > > out
> > > > of
> > > > > the scope of this
> > &

Re: [DISCUSS] FLIP-390: Support System out and err to be redirected to LOG or discarded

2023-11-08 Thread Rui Fan

wo options to manage
> this,
> > > seems reasonable.
> > >
> > > +1
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> > >
> > > From: Piotr Nowojski 
> > > Date: Wednesday, November 8, 2023 at 7:38 AM
> > > To: dev@flink.apache.org 
> > > Subject: Re: [DISCUSS] FLIP-390: Support System out and err to be
> > > redirected to LOG or discarded
> > > Hi Rui,
> > >
> > > Thanks for the proposal.
> > >
> > > +1 I don't have any major comments :)
> > >
> > > One nit. In `SystemOutRedirectToLog` in this code:
> > >
> > >System.arraycopy(buf, count - LINE_SEPARATOR_LENGTH, bytes,
> 0,
> > > LINE_SEPARATOR_LENGTH);
> > > return Arrays.equals(LINE_SEPARATOR_BYTES, bytes)
> > >
> > > Is there a reason why you are suggesting to copy out bytes from `buf`
> to
> > > `bytes`,
> > > instead of using `Arrays.equals(int[] a, int aFromIndex, int aToIndex,
> > > int[] b, int bFromIndex, int bToIndex)`?
> > >
> > > Best,
> > > Piotrek
> > >
> > > śr., 8 lis 2023 o 11:53 Rui Fan <1996fan...@gmail.com> napisał(a):
> > >
> > > > Hi all!
> > > >
> > > > I would like to start a discussion of FLIP-390: Support System out
> and
> > > err
> > > > to be redirected to LOG or discarded[1].
> > > >
> > > > In various production environments, either cloud native or physical
> > > > machines, the disk space that Flink TaskManager can use is limited.
> > > >
> > > > In general, the flink users shouldn't use the `System.out.println` in
> > > > production,
> > > > however this may happen when the number of Flink jobs and job
> > developers
> > > > is very large. Flink job may use System.out to output a large amount
> of
> > > > data
> > > > to the taskmanager.out file. This file will not roll, it will always
> > > > increment.
> > > > Eventually the upper limit of what the TM can be used for is reached.
> > > >
> > > > We can support System out and err to be redirected to LOG or
> discarded,
> > > > the LOG can roll and won't increment forever.
> > > >
> > > > This feature is useful for SREs who maintain Flink environments, they
> > can
> > > > redirect System.out to LOG by default. Although the cause of this
> > problem
> > > > is
> > > > that the user's code is not standardized, for SRE, pushing users to
> > > modify
> > > > the code one by one is usually a very time-consuming operation. It's
> > also
> > > > useful for job stability where System.out is accidentally used.
> > > >
> > > > Looking forward to your feedback, thanks~
> > > >
> > > > [1]
> > >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fx%2F4guZE&data=05%7C01%7Cargoyal%40linkedin.com%7C937821de7bd846e6b97408dbe070beae%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638350547072823674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zEv6B0Xiq2SNuU6Fm%2BAXnH%2BRILbm6Q0uDRbN7h6iaPM%3D&reserved=0
> > > <https://cwiki.apache.org/confluence/x/4guZE>
> > > >
> > > > Best,
> > > > Rui
> > > >
> > >
> >
>
>
> --
> Best,
> Hangxiang.
>

Re: [DISCUSS] FLIP-390: Support System out and err to be redirected to LOG or discarded

2023-11-09 Thread Rui Fan

Hi Piotr,

Thanks for your feedback!

> Or implement your own loop? It shouldn't be more than a couple of lines.

Implementing it directly is fine, I have updated the FLIP.
And this logic can be found in the  `isLineEnded` method.

Best,
Rui

On Thu, Nov 9, 2023 at 11:00 PM Piotr Nowojski 
wrote:

> Hi Rui,
>
> > I see java8 doesn't have `Arrays.equals(int[] a, int aFromIndex, int
> > aToIndex, int[] b, int bFromIndex, int bToIndex)`,
> > and java11 has it. Do you have any other suggestions for java8?
>
> Maybe use `ByteBuffer.wrap`?
>
> ByteBuffer.wrap(array, ..., ...).equals(ByteBuffer.wrap(array2, ..., ...))
>
> This shouldn't have overheads as far as I remember.
>
> Or implement your own loop? It shouldn't be more than a couple of lines.
>
> Best,
> Piotrek
>
> czw., 9 lis 2023 o 06:43 Rui Fan <1996fan...@gmail.com> napisał(a):
>
> > Hi Piotr, Archit, Feng and Hangxiang:
> >
> > Thanks a lot for your feedback!
> >
> > Following is my comment, please correct me if I misunderstood anything!
> >
> > To Piotr:
> >
> > > Is there a reason why you are suggesting to copy out bytes from `buf`
> to
> > `bytes`,
> > > instead of using `Arrays.equals(int[] a, int aFromIndex, int aToIndex,
> > int[] b, int bFromIndex, int bToIndex)`?
> >
> > I see java8 doesn't have `Arrays.equals(int[] a, int aFromIndex, int
> > aToIndex, int[] b, int bFromIndex, int bToIndex)`,
> > and java11 has it. Do you have any other suggestions for java8?
> >
> > Also, this code doesn't run in production. As the comment of
> > System.lineSeparator():
> >
> > > On UNIX systems, it returns {@code "\n"}; on Microsoft
> > > Windows systems it returns {@code "\r\n"}.
> >
> > So Mac and Linux just return one character, we will compare
> > one byte directly.
> >
> >
> >
> > To Feng:
> >
> > > Will they be written to the taskManager.log file by default
> > > or the taskManager.out file?
> >
> > I prefer LOG as the default value for taskmanager.system-out.mode.
> > It's useful for job stability and doesn't introduce significant impact to
> > users. Also, our production has already used this feature for
> > more than 1 years, it works well.
> >
> > However, I write the DEFAULT as the default value for
> > taskmanager.system-out.mode, because when the community introduces
> > new options, the default value often selects the original behavior.
> >
> > Looking forward to hear more thoughts from community about this
> > default value.
> >
> > > If we can make taskmanager.out splittable and rolling, would it be
> > > easier for users to use this feature?
> >
> > Making taskmanager.out splittable and rolling is a good choice!
> > I have some concerns about it:
> >
> > 1. Users may also want to use LOG.info in their code and just
> >   accidentally use System.out.println. It is possible that they will
> >   also find the logs directly in taskmanager.log.
> > 2. I'm not sure whether the rolling strategy is easy to implement.
> >   If we do it, it's necessary to define a series of flink options similar
> >   to log options, such as: fileMax(how many files should be retained),
> >   fileSize(The max size each file), fileNamePatten (The suffix of file
> > name),
> > 3. Check the file size periodically: all logs are written by log plugin,
> >   they can check the log file size after writing. However, System.out
> >   are written directly. And flink must start a thread to check the latest
> >   taskmanager.out size periodically. If it's too quick, most of job
> aren't
> >   necessary. If it's too slow, the file size cannot be controlled
> properly.
> >
> > Redirect it to LOG.info may be a reasonable and easy choice.
> > The user didn't really want to log into taskmanager.out, it just
> > happened by accident.
> >
> >
> >
> > To Hangxiang:
> >
> > > 1. I have a similar concern as Feng. Will we redirect to another log
> file
> > > not taskManager.log ?
> >
> > Please see my last comment, thanks!
> >
> > > taskManager.log contains lots of important information like init log.
> It
> > > will be rolled quickly if we redirect out and error here.
> >
> > IIUC, this issue isn't caused by System.out, and it can happen if user
> > call a lot of LOG.info. As I mentioned before: the user didn't really
> want
> > to log into taskmanager.out, it just ha

Re: [VOTE] FLIP-381: Deprecate configuration getters/setters that return/set complex Java objects

2023-11-10 Thread Rui Fan

+1(binding)

Best,
Rui

On Fri, Nov 10, 2023 at 11:58 AM Junrui Lee  wrote:

> Hi everyone,
>
> Thank you to everyone for the feedback on FLIP-381: Deprecate configuration
> getters/setters that return/set complex Java objects[1] which has been
> discussed in this thread [2].
>
> I would like to start a vote for it. The vote will be open for at least 72
> hours (excluding weekends) unless there is an objection or not enough
> votes.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278464992
> [2]https://lists.apache.org/thread/y5owjkfxq3xs9lmpdbl6d6jmqdgbjqxo
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-10 Thread Rui Fan

I'll start voting next Monday if there isn't any other comment.

Best,
Rui

On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Konstantin and Max,
>
> Thanks for your feedback!
>
> Sorry, I forgot to mention the default value of
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
>
> Retrying forever sounds good to me, I have added it to the FLIP:
>
> The default value of
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is
> Integer.MAX_VALUE.
>
> Best,
> Rui
>
> On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels  wrote:
>
>> Hey Rui,
>>
>> +1 for making exponential backoff the default. I agree with Konstantin
>> that retrying forever is a good default for exponential backoff
>> because oftentimes the issue will resolve eventually. The purpose of
>> exponential backoff is precisely to continue to retry without causing
>> too much load. However, I'm not against adding an optional max number
>> of retries.
>>
>> -Max
>>
>> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf 
>> wrote:
>> >
>> > Hi Rui,
>> >
>> > Thank you for this proposal and working on this. I also agree that
>> > exponential back off makes sense as a new default in general. I think
>> > restarting indefinitely (no max attempts) makes sense by default,
>> though,
>> > but of course allowing users to change is valuable.
>> >
>> > So, overall +1.
>> >
>> > Cheers,
>> >
>> > Konstantin
>> >
>> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
>> 1996fan...@gmail.com>:
>> >
>> > > Hi all,
>> > >
>> > > I would like to start a discussion on FLIP-364: Improve the
>> > > restart-strategy[1]
>> > >
>> > > As we know, the restart-strategy is critical for flink jobs, it mainly
>> > > has two functions:
>> > > 1. When an exception occurs in the flink job, quickly restart the job
>> > > so that the job can return to the running state.
>> > > 2. When a job cannot be recovered after frequent restarts within
>> > > a certain period of time, Flink will not retry but will fail the job.
>> > >
>> > > The current restart-strategy support for function 2 has some issues:
>> > > 1. The exponential-delay doesn't have the max attempts mechanism,
>> > > it means that flink will restart indefinitely even if it fails
>> frequently.
>> > > 2. For multi-region streaming jobs and all batch jobs, the failure of
>> > > each region will increase the total number of job failures by +1,
>> > > even if these failures occur at the same time. If the number of
>> > > failures increases too quickly, it will be difficult to set a
>> reasonable
>> > > number of retries.
>> > > If the maximum number of failures is set too low, the job can easily
>> > > reach the retry limit, causing the job to fail. If set too high, some
>> jobs
>> > > will never fail.
>> > >
>> > > In addition, when the above two problems are solved, we can also
>> > > discuss whether exponential-delay can replace fixed-delay as the
>> > > default restart-strategy. In theory, exponential-delay is smarter and
>> > > friendlier than fixed-delay.
>> > >
>> > > I also thank Zhu Zhu for his suggestions on the option name in
>> > > FLINK-32895[2] in advance.
>> > >
>> > > Looking forward to and welcome everyone's feedback and suggestions,
>> thank
>> > > you.
>> > >
>> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
>> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
>> > >
>> > > Best,
>> > > Rui
>> > >
>> >
>> >
>> > --
>> > https://twitter.com/snntrable
>> > https://github.com/knaufk
>>
>

[VOTE] FLIP-364: Improve the restart-strategy

2023-11-12 Thread Rui Fan

Hi everyone,

Thank you to everyone for the feedback on FLIP-364: Improve the
restart-strategy[1]
which has been discussed in this thread [2].

I would like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym

Best,
Rui

[VOTE] Release 1.16.3, release candidate #1

2023-11-13 Thread Rui Fan

Hi everyone,

Please review and vote on the release candidate #1 for the version 1.16.3,

as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "release-1.16.3-rc1" [5],

* website pull request listing the new release and adding announcement blog
post [6].


The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.


[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353259

[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.16.3-rc1/

[3] https://dist.apache.org/repos/dist/release/flink/KEYS

[4] https://repository.apache.org/content/repositories/orgapacheflink-1670/

[5] https://github.com/apache/flink/releases/tag/release-1.16.3-rc1

[6] https://github.com/apache/flink-web/pull/698

Thanks,
Release Manager

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Rui Fan

> > Flink currently will try to recognize concurrent failures and group them
> > together, which can be seen in the web UI. So how about to align the
> > failure counting with the concurrent failures computing? This can make it
> > more consistent and easier for understanding. It will require changes to
> > the concurrent failures computing though, i.e. taking the backoff time
> > into consideration. So maybe we can open a seperate FLIP for this change.
> >
> > Thanks,
> > Zhu
> >
> > Rui Fan <1996fan...@gmail.com> 于2023年11月10日周五 18:22写道：
> >
> > > I'll start voting next Monday if there isn't any other comment.
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > Hi Konstantin and Max,
> > > >
> > > > Thanks for your feedback!
> > > >
> > > > Sorry, I forgot to mention the default value of
> > > >
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
> > > >
> > > > Retrying forever sounds good to me, I have added it to the FLIP:
> > > >
> > > > The default value of
> > > >
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`
> > is
> > > > Integer.MAX_VALUE.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels 
> > > wrote:
> > > >
> > > >> Hey Rui,
> > > >>
> > > >> +1 for making exponential backoff the default. I agree with
> Konstantin
> > > >> that retrying forever is a good default for exponential backoff
> > > >> because oftentimes the issue will resolve eventually. The purpose of
> > > >> exponential backoff is precisely to continue to retry without
> causing
> > > >> too much load. However, I'm not against adding an optional max
> number
> > > >> of retries.
> > > >>
> > > >> -Max
> > > >>
> > > >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <
> kna...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > Hi Rui,
> > > >> >
> > > >> > Thank you for this proposal and working on this. I also agree that
> > > >> > exponential back off makes sense as a new default in general. I
> > think
> > > >> > restarting indefinitely (no max attempts) makes sense by default,
> > > >> though,
> > > >> > but of course allowing users to change is valuable.
> > > >> >
> > > >> > So, overall +1.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Konstantin
> > > >> >
> > > >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
> > > >> 1996fan...@gmail.com>:
> > > >> >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > I would like to start a discussion on FLIP-364: Improve the
> > > >> > > restart-strategy[1]
> > > >> > >
> > > >> > > As we know, the restart-strategy is critical for flink jobs, it
> > > mainly
> > > >> > > has two functions:
> > > >> > > 1. When an exception occurs in the flink job, quickly restart
> the
> > > job
> > > >> > > so that the job can return to the running state.
> > > >> > > 2. When a job cannot be recovered after frequent restarts within
> > > >> > > a certain period of time, Flink will not retry but will fail the
> > > job.
> > > >> > >
> > > >> > > The current restart-strategy support for function 2 has some
> > issues:
> > > >> > > 1. The exponential-delay doesn't have the max attempts
> mechanism,
> > > >> > > it means that flink will restart indefinitely even if it fails
> > > >> frequently.
> > > >> > > 2. For multi-region streaming jobs and all batch jobs, the
> failure
> > > of
> > > >> > > each region will increase the total number of job failures by
> +1,
> > > >> > > even if these failures occur at the same time. If the number of
> > > >> > > failures increases too quickly, it will be difficult to set a
> > > >> reasonable
> > > >> > > number of retries.
> > > >> > > If the maximum number of failures is set too low, the job can
> > easily
> > > >> > > reach the retry limit, causing the job to fail. If set too high,
> > > some
> > > >> jobs
> > > >> > > will never fail.
> > > >> > >
> > > >> > > In addition, when the above two problems are solved, we can also
> > > >> > > discuss whether exponential-delay can replace fixed-delay as the
> > > >> > > default restart-strategy. In theory, exponential-delay is
> smarter
> > > and
> > > >> > > friendlier than fixed-delay.
> > > >> > >
> > > >> > > I also thank Zhu Zhu for his suggestions on the option name in
> > > >> > > FLINK-32895[2] in advance.
> > > >> > >
> > > >> > > Looking forward to and welcome everyone's feedback and
> > suggestions,
> > > >> thank
> > > >> > > you.
> > > >> > >
> > > >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> > > >> > >
> > > >> > > Best,
> > > >> > > Rui
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > https://twitter.com/snntrable
> > > >> > https://github.com/knaufk
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-11-14 Thread Rui Fan

Hi Mingliang:

Thanks you for the feedback here!

Glad to hear Netflix have made exponential-delay as the
default restart strategy. Our production(Shopee) also makes
exponential-delay as the default since May 2021, and the
current number of flink jobs far exceeds tens of thousands.
These jobs work well.

Note: Our internal exponential-delay solves the problem
of a large number of tasks failing in a short period of time
causing restartAttempts to increase rapidly.

Based on your production, do you have any suggestions
about default values of exponential-delay configuration?

Zhu and Jing may also be interested in this question.

Following are FLIP-364 proposed default values:

restart-strategy.exponential-delay.max-attempts-before-reset-backoff :
Integer.MAX_VALUE
restart-strategy.exponential-delay.initial-backoff : 1s
restart-strategy.exponential-delay.backoff-multiplier : 1.2
restart-strategy.exponential-delay.jitter-factor : 0.1
restart-strategy.exponential-delay.max-backoff : 1 min
restart-strategy.exponential-delay.reset-backoff-threshold : 1h

Looking forward to your feedback! And I will start a discussion
on user mail list to collect more feedback.

In addition, I understand that the community needs to consider
a lot of compatibility and risks when modifying the default value.
If this is very difficult to reach consensus on, I can remove
this item from FLIP.

Best,
Rui

On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu  wrote:

> Thanks Rui for driving this. I just call out that making exponential-delay
> the default is a good change. At Netflix, we have enabled this as the
> default restart strategy 2 quarters ago and it has been working well.
> Keeping it restarting indefinitely by default makes sense to me.
>
> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi all,
> >
> > I would like to start a discussion on FLIP-364: Improve the
> > restart-strategy[1]
> >
> > As we know, the restart-strategy is critical for flink jobs, it mainly
> > has two functions:
> > 1. When an exception occurs in the flink job, quickly restart the job
> > so that the job can return to the running state.
> > 2. When a job cannot be recovered after frequent restarts within
> > a certain period of time, Flink will not retry but will fail the job.
> >
> > The current restart-strategy support for function 2 has some issues:
> > 1. The exponential-delay doesn't have the max attempts mechanism,
> > it means that flink will restart indefinitely even if it fails
> frequently.
> > 2. For multi-region streaming jobs and all batch jobs, the failure of
> > each region will increase the total number of job failures by +1,
> > even if these failures occur at the same time. If the number of
> > failures increases too quickly, it will be difficult to set a reasonable
> > number of retries.
> > If the maximum number of failures is set too low, the job can easily
> > reach the retry limit, causing the job to fail. If set too high, some
> jobs
> > will never fail.
> >
> > In addition, when the above two problems are solved, we can also
> > discuss whether exponential-delay can replace fixed-delay as the
> > default restart-strategy. In theory, exponential-delay is smarter and
> > friendlier than fixed-delay.
> >
> > I also thank Zhu Zhu for his suggestions on the option name in
> > FLINK-32895[2] in advance.
> >
> > Looking forward to and welcome everyone's feedback and suggestions, thank
> > you.
> >
> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > [2] https://issues.apache.org/jira/browse/FLINK-32895
> >
> > Best,
> > Rui
> >
>

1 2 3 4 5 6 >

1 - 100 of 570 matches

Mail list logo