Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren

2023-04-23 Thread Zhanghao Chen
Congratulations, Qingsheng!

Best,
Zhanghao Chen

From: Shammon FY 
Sent: Sunday, April 23, 2023 17:22
To: dev@flink.apache.org 
Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren

Congratulations, Qingsheng!

Best,
Shammon FY

On Sun, Apr 23, 2023 at 4:40 PM Weihua Hu  wrote:

> Congratulations, Qingsheng!
>
> Best,
> Weihua
>
>
> On Sun, Apr 23, 2023 at 3:53 PM Yun Tang  wrote:
>
> > Congratulations, Qingsheng!
> >
> > Best
> > Yun Tang
> > 
> > From: weijie guo 
> > Sent: Sunday, April 23, 2023 14:50
> > To: dev@flink.apache.org 
> > Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren
> >
> > Congratulations, Qingsheng!
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Geng Biao  于2023年4月23日周日 14:29写道:
> >
> > > Congrats, Qingsheng!
> > > Best,
> > > Biao Geng
> > >
> > > 获取 Outlook for iOS<https://aka.ms/o0ukef>
> > > 
> > > 发件人: Wencong Liu 
> > > 发送时间: Sunday, April 23, 2023 11:06:39 AM
> > > 收件人: dev@flink.apache.org 
> > > 主题: Re:[ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren
> > >
> > > Congratulations, Qingsheng!
> > >
> > > Best,
> > > Wencong LIu
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2023-04-21 19:47:52, "Jark Wu"  wrote:
> > > >Hi everyone,
> > > >
> > > >We are thrilled to announce that Leonard Xu has joined the Flink PMC!
> > > >
> > > >Leonard has been an active member of the Apache Flink community for
> many
> > > >years and became a committer in Nov 2021. He has been involved in
> > various
> > > >areas of the project, from code contributions to community building.
> His
> > > >contributions are mainly focused on Flink SQL and connectors,
> especially
> > > >leading the flink-cdc-connectors project to receive 3.8+K GitHub
> stars.
> > He
> > > >authored 150+ PRs, and reviewed 250+ PRs, and drove several FLIPs
> (e.g.,
> > > >FLIP-132, FLIP-162). He has participated in plenty of discussions in
> the
> > > >dev mailing list, answering questions about 500+ threads in the
> > > >user/user-zh mailing list. Besides that, he is community minded, such
> as
> > > >being the release manager of 1.17, verifying releases, managing
> release
> > > >syncs, etc.
> > > >
> > > >Congratulations and welcome Leonard!
> > > >
> > > >Best,
> > > >Jark (on behalf of the Flink PMC)
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2023-04-21 19:50:02, "Jark Wu"  wrote:
> > > >Hi everyone,
> > > >
> > > >We are thrilled to announce that Qingsheng Ren has joined the Flink
> PMC!
> > > >
> > > >Qingsheng has been contributing to Apache Flink for a long time. He is
> > the
> > > >core contributor and maintainer of the Kafka connector and
> > > >flink-cdc-connectors, bringing users stability and ease of use in both
> > > >projects. He drove discussions and implementations in FLIP-221,
> > FLIP-288,
> > > >and the connector testing framework. He is continuously helping with
> the
> > > >expansion of the Flink community and has given several talks about
> Flink
> > > >connectors at many conferences, such as Flink Forward Global and Flink
> > > >Forward Asia. Besides that, he is willing to help a lot in the
> community
> > > >work, such as being the release manager for both 1.17 and 1.18,
> > verifying
> > > >releases, and answering questions on the mailing list.
> > > >
> > > >Congratulations and welcome Qingsheng!
> > > >
> > > >Best,
> > > >Jark (on behalf of the Flink PMC)
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu

2023-04-23 Thread Zhanghao Chen
Congratulations, Leonard!


Best,
Zhanghao Chen

From: Shammon FY 
Sent: Sunday, April 23, 2023 17:22
To: dev@flink.apache.org 
Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu

Congratulations, Leonard!

Best,
Shammon FY

On Sun, Apr 23, 2023 at 5:07 PM Xianxun Ye  wrote:

> Congratulations, Leonard!
>
> Best regards,
>
> Xianxun
>
> > 2023年4月23日 09:10,Lincoln Lee  写道:
> >
> > Congratulations, Leonard!
>
>


Re: Permission issue : Flink Code Push

2023-05-03 Thread Zhanghao Chen
Hi Archit,

The procedure is:

  1.  Fork the flink repo
  2.  Implement your change and push to your fork
  3.  Create a pull request from your fork
  4.  Wait for reviews

You can refer to 
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork
 for more details.
[https://github.githubassets.com/images/modules/open_graph/github-logo.png]<https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork>
Creating a pull request from a fork - GitHub 
Docs<https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork>
You can create a pull request to propose changes you've made to a fork of 
an upstream repository.
docs.github.com



Best,
Zhanghao Chen

From: Archit Goyal 
Sent: Wednesday, May 3, 2023 1:50
To: dev@flink.apache.org 
Subject: Permission issue : Flink Code Push

Hi,

I am new to the Flink community and am working on the 
FLINK-12869<https://issues.apache.org/jira/browse/FLINK-12869>.

I have a PR to be uploaded but I am unable to do push. I get a 403 error:

argoyal@argoyal flink % git push --set-upstream origin YarnAcl
remote: Permission to apache/flink.git denied to architgyl.
fatal: unable to access 'https://github.com/apache/flink.git/': The requested 
URL returned error: 403

Looks like it is a permission issue. Can someone please grant me access so that 
I can raise a PR.

Thanks,
Archit Goyal


[VOTE] FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-10-08 Thread Zhanghao Chen
Hi All,

Thanks for all the feedback on FLIP-367: Support Setting Parallelism for 
Table/SQL Sources [1][2].

I'd like to start a vote for FLIP-367. The vote will be open until Oct 12th 
12:00 PM GMT) unless there is an objection or insufficient votes.

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150
[2] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87

Best,
Zhanghao Chen


Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

2023-10-09 Thread Zhanghao Chen
Hi Yun and Yu,

Thanks for driving this. This would definitely help users identify performance 
bottlenecks, especially for the cases where the bottleneck lies in the system 
stack (e.g. GC), and big +1 for the downloadable flamegraph to ease sharing. 
I'm wondering if we could add this for the job manager as well. In the OLAP 
scenario and sometimes in the streaming scenario (when there're some heavy 
operations during execution plan generation or in operator coordinators), the 
JM can have bottleneck as well.

Best,
Zhanghao Chen

From: Yu Chen 
Sent: Monday, October 9, 2023 17:24
To: dev@flink.apache.org 
Subject: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on 
taskmanagers

Hi all,

Yun Tang and I are opening this thread to discuss our proposal to integrate
async-profiler's capabilities for profiling taskmananger (e.g., generating
flame graphs) in the Flink Web [1].


Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by
sampling task threads. The results generated in such way missing the
relevant stack of java threads and system calls. The async-profiler[2] is a
low-overhead sampling profiler for Java, but the steps to use it in the
production environment are cumbersome and suffer from permissions and
security risks.

Therefore, we propose adding rest APIs to provide the capability to invoke
async-profiler on multiple platforms through JNI, which can be easily
operated on Web UI. This enhancement will improve the efficiency and
experience of Flink users in identifying performance bottlenecks.



Please refer to the FLIP document for more details about the proposed design
and implementation. We welcome any feedback and opinions on this proposal.



[1] FLIP-375: Built-in cross-platform powerful java profiler on
taskmanagers - Apache Flink - Apache Software Foundation
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers>

[2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP profiler
for Java featuring AsyncGetCallTrace + perf_events
<https://github.com/async-profiler/async-profiler>



Best regards,

Yun Tang and Yu Chen


[DISCUSS][FLINK-33240] Document deprecated options as well

2023-10-11 Thread Zhanghao Chen
Hi Flink users and developers,

Currently, Flink won't generate doc for the deprecated options. This might 
confuse users when upgrading from an older version of Flink: they have to 
either carefully read the release notes or check the source code for upgrade 
guidance on deprecated options.

I propose to document deprecated options as well, with a "(deprecated)" tag 
placed at the beginning of the option description to highlight the deprecation 
status [1].

Looking forward to your feedbacks on it.

[1] https://issues.apache.org/jira/browse/FLINK-33240

Best,
Zhanghao Chen


[RESULT][VOTE] FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-10-12 Thread Zhanghao Chen
Hi devs,

Thanks everyone for your review and the votes!

I am happy to announce that FLIP-367: Support Setting Parallelism for Table/SQL 
Sources [1] has been accepted.

There are 10 binding votes and 5 non-binding votes [2]:
- Benchao Li (binding)
- Shammon FY (binding)
- Weihua Hu (binding)
- Yun Tang (binding)
- Yangze Guo (binding)
- Feng Jing
- xiangyu feng
- Ahmed Hamd
- Jing Ge (binding)
- Leonard Xu (binding)
- Lincoln Lee (binding)
- Sergey Nuyanzin (binding)
- Martijn Visser (binding)
- Samrat Deb
- Jane Chan

There is no disapproving vote.

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150
[2] https://lists.apache.org/thread/bzlcw28jn0xx1hh45q0ry8wnxf0xoptg

Best,
Zhanghao Chen


Re: [VOTE] FLIP-366: Support standard YAML for FLINK configuration

2023-10-12 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Junrui Lee 
Sent: Friday, October 13, 2023 10:12
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-366: Support standard YAML for FLINK configuration

Hi all,

Thank you to everyone for the feedback on FLIP-366[1]: Support standard
YAML for FLINK configuration in the discussion thread [2].
I would like to start a vote for it. The vote will be open for at least 72
hours (excluding weekends, unless there is an objection or an insufficient
number of votes).

Thanks,
Junrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-366%3A+Support+standard+YAML+for+FLINK+configuration
[2]https://lists.apache.org/thread/qfhcm7h8r5xkv38rtxwkghkrcxg0q7k5


Re: [VOTE] FLIP-375: Built-in cross-platform powerful java profiler

2023-10-17 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Yu Chen 
Sent: Tuesday, October 17, 2023 15:51
To: dev@flink.apache.org 
Cc: myas...@live.com 
Subject: [VOTE] FLIP-375: Built-in cross-platform powerful java profiler

Hi all,

Thank you to everyone for the feedback and detailed comments on FLIP-375[1].
Based on the discussion thread [2], I think we are ready to take a vote to
contribute this to Flink.
I'd like to start a vote for it.
The vote will be open for at least 72 hours (excluding weekends, unless
there is an objection or an insufficient number of votes).

Thanks,
Yu Chen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler
[2] https://lists.apache.org/thread/tp5vqgspsdko66dr6vm7cgtod9k2pct7


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Zhanghao Chen
Hi Alexander,

I haven't done a complete analysis yet. But through simple code search, roughly 
35 options would be added with this change. Also note that some old options 
defined in a ConfigConstant class won's be added here as flink-doc won't 
discover these constant-based options.

Best,
Zhanghao Chen

From: Alexander Fedulov 
Sent: Tuesday, October 31, 2023 18:12
To: dev@flink.apache.org 
Cc: u...@flink.apache.org 
Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well

Hi Zhanghao,

Thanks for the proposition.
In general +1, this sounds like a good idea as long it is clear that the usage 
of these settings is discouraged.
Just one minor concern - the configuration page is already very long, do you 
have a rough estimate of how many more options would be added with this change?

Best,
Alexander Fedulov

On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  
wrote:
Thanks for your proposal, Zhanghao Chen. I think it adds more transparency
to the configuration documentation.

+1 from my side on the proposal

On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>>
wrote:

> Hi Flink users and developers,
>
> Currently, Flink won't generate doc for the deprecated options. This might
> confuse users when upgrading from an older version of Flink: they have to
> either carefully read the release notes or check the source code for
> upgrade guidance on deprecated options.
>
> I propose to document deprecated options as well, with a "(deprecated)"
> tag placed at the beginning of the option description to highlight the
> deprecation status [1].
>
> Looking forward to your feedbacks on it.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33240
>
> Best,
> Zhanghao Chen
>


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-11-01 Thread Zhanghao Chen
Hi Samrat and Ruan,

Thanks for the suggestion. I'm actually in favor of adding the deprecated 
options in the same section as the non-deprecated ones. This would make user 
search for descriptions of the replacement options more easily. It would be a 
different story for options deprecated because the related API/module is 
entirely deprecated, e.g. DataSet API. In that case, users would not search for 
replacement on an individual option but rather need to migrate to a new API, 
and it would be better to move these options to a separate section. WDYT?

Best,
Zhanghao Chen

From: Samrat Deb 
Sent: Wednesday, November 1, 2023 15:31
To: dev@flink.apache.org 
Cc: u...@flink.apache.org 
Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well

Thanks for the proposal ,
+1 for adding deprecated identifier

[Thought] Can we have seperate section / page for deprecated configs ? Wdut
?


Bests,
Samrat


On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> Hi Zhanghao,
>
> Thanks for the proposition.
> In general +1, this sounds like a good idea as long it is clear that the
> usage of these settings is discouraged.
> Just one minor concern - the configuration page is already very long, do
> you have a rough estimate of how many more options would be added with this
> change?
>
> Best,
> Alexander Fedulov
>
> On Mon, 30 Oct 2023 at 18:24, Matthias Pohl  .invalid>
> wrote:
>
> > Thanks for your proposal, Zhanghao Chen. I think it adds more
> transparency
> > to the configuration documentation.
> >
> > +1 from my side on the proposal
> >
> > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen  >
> > wrote:
> >
> > > Hi Flink users and developers,
> > >
> > > Currently, Flink won't generate doc for the deprecated options. This
> > might
> > > confuse users when upgrading from an older version of Flink: they have
> to
> > > either carefully read the release notes or check the source code for
> > > upgrade guidance on deprecated options.
> > >
> > > I propose to document deprecated options as well, with a "(deprecated)"
> > > tag placed at the beginning of the option description to highlight the
> > > deprecation status [1].
> > >
> > > Looking forward to your feedbacks on it.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-33240
> > >
> > > Best,
> > > Zhanghao Chen
> > >
> >
>


Re: [VOTE] FLIP-376: Add DISTRIBUTED BY clause

2023-11-07 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Timo Walther 
Sent: Monday, November 6, 2023 19:38
To: dev 
Subject: [VOTE] FLIP-376: Add DISTRIBUTED BY clause

Hi everyone,

I'd like to start a vote on FLIP-376: Add DISTRIBUTED BY clause[1] which
has been discussed in this thread [2].

The vote will be open for at least 72 hours unless there is an objection
or not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause
[2] https://lists.apache.org/thread/z1p4f38x9ohv29y4nlq8g1wdxwfjwxd1

Cheers,
Timo


[DISCUSS] FLIP-395: Deprecate Global Aggregator Manager

2023-11-20 Thread Zhanghao Chen
Hi all,

I'd like to start a discussion of FLIP-395: Deprecate Global Aggregator Manager 
[1].

Global Aggregate Manager was introduced in [2] to support event time 
synchronization across sources and more generally, coordination of parallel 
tasks. AFAIK, this was only used in the Kinesis source for an early version of 
watermark alignment. Operator Coordinator, introduced in FLIP-27, provides a 
more powerful and elegant solution for that need and is part of the new source 
API standard. FLIP-217 further provides a complete solution for watermark 
alignment of source splits on top of the Operator Coordinator mechanism. 
Furthermore, Global Aggregate Manager manages state in JobMaster object, 
causing problems for adaptive parallelism changes [3].

Therefore, I propose to deprecate the use of Global Aggregate Manager, which 
can improve the maintainability of the Flink codebase without compromising its 
functionality.

Looking forward to your feedbacks, thanks.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Deprecate+Global+Aggregator+Manager
[2] https://issues.apache.org/jira/browse/FLINK-10886
[3] https://issues.apache.org/jira/browse/FLINK-31245

Best,
Zhanghao Chen


[DISCUSS] FLIP-397: Add config options for administrator JVM options

2023-11-27 Thread Zhanghao Chen
Hi devs,

I'd like to start a discussion on FLIP-397: Add config options for 
administrator JVM options [1].

In production environments, users typically develop and operate their Flink 
jobs through a managed platform. Users may need to add JVM options to their 
Flink applications (e.g. to tune GC options). They typically use the 
env.java.opts.x series of options to do so. Platform administrators also have a 
set of JVM options to apply by default, e.g. to use JVM 17, enable GC logging, 
or apply pretuned GC options, etc. Both use cases will need to set the same 
series of options and will clobber one another. Similar issues have been 
described in SPARK-23472 [2].

Therefore, I propose adding a set of default JVM options for administrator use 
that prepends the user-set extra JVM options.

Looking forward to hearing from you.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options
[2] https://issues.apache.org/jira/browse/SPARK-23472

Best,
Zhanghao Chen


Re: [DISCUSS] Contribute Flink Doris Connector to the Flink community

2023-11-27 Thread Zhanghao Chen
Sounds great. Thanks for driving this.

Best,
Zhanghao Chen

From: wudi <676366...@qq.com.INVALID>
Sent: Sunday, November 26, 2023 18:22
To: dev@flink.apache.org 
Subject: [DISCUSS] Contribute Flink Doris Connector to the Flink community

Hi all,

At present, Flink Connector and Flink's repository have been decoupled[1].
At the same time, the Flink-Doris-Connector[3] has been maintained based on the 
Apache Doris[2] community.
I think the Flink Doris Connector can be migrated to the Flink community 
because it It is part of Flink Connectors and can also expand the ecosystem of 
Flink Connectors.

I volunteer to move this forward if I can.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development
[2] https://doris.apache.org/
[3] https://github.com/apache/doris-flink-connector

--

Brs,
di.wu


Re: [VOTE] FLIP-391: Deprecate RuntimeContext#getExecutionConfig

2023-11-28 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Junrui Lee 
Sent: Tuesday, November 28, 2023 12:34
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-391: Deprecate RuntimeContext#getExecutionConfig

Hi everyone,

Thank you to everyone for the feedback on FLIP-391: Deprecate
RuntimeContext#getExecutionConfig[1] which has been discussed in this
thread [2].

I would like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465937
[2]https://lists.apache.org/thread/cv4x3372g5txw1j20r5l1vwlqrvjoqq5


Re: [VOTE] FLIP-364: Improve the restart-strategy

2023-11-30 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Rui Fan <1996fan...@gmail.com>
Sent: Monday, November 13, 2023 11:01
To: dev 
Subject: [VOTE] FLIP-364: Improve the restart-strategy

Hi everyone,

Thank you to everyone for the feedback on FLIP-364: Improve the
restart-strategy[1]
which has been discussed in this thread [2].

I would like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym

Best,
Rui


Re: [DISCUSS] FLIP-395: Deprecate Global Aggregator Manager

2023-12-04 Thread Zhanghao Chen
Hi Benchao,

I think part of the reason is that a general global coordination mechanism is 
complex and hence subject to some internals changes in the future. Instead of 
directly exposing the full mechanism to users, it might be better to expose 
some well-defined subset of the feature set to users.

I'm also ccing the email to Piotr and David for their suggestions on this.

Best,
Zhanghao Chen

From: Benchao Li 
Sent: Monday, November 27, 2023 13:03
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP-395: Deprecate Global Aggregator Manager

+1 for the idea.

Currently OperatorCoordinator is still marked as @Internal, shouldn't
it be a public API already?

Besides, GlobalAggregatorManager supports coordination between
different operators, but OperatorCoordinator only supports
coordination within one operator. And CoordinatorStore introduced in
FLINK-24439 opens the door for multi operators. Again, should it also
be a public API too?

Weihua Hu  于2023年11月27日周一 11:05写道:
>
> Thanks Zhanghao for driving this FLIP.
>
> +1 for this.
>
> Best,
> Weihua
>
>
> On Mon, Nov 20, 2023 at 5:49 PM Zhanghao Chen 
> wrote:
>
> > Hi all,
> >
> > I'd like to start a discussion of FLIP-395: Deprecate Global Aggregator
> > Manager [1].
> >
> > Global Aggregate Manager was introduced in [2] to support event time
> > synchronization across sources and more generally, coordination of parallel
> > tasks. AFAIK, this was only used in the Kinesis source for an early version
> > of watermark alignment. Operator Coordinator, introduced in FLIP-27,
> > provides a more powerful and elegant solution for that need and is part of
> > the new source API standard. FLIP-217 further provides a complete solution
> > for watermark alignment of source splits on top of the Operator Coordinator
> > mechanism. Furthermore, Global Aggregate Manager manages state in JobMaster
> > object, causing problems for adaptive parallelism changes [3].
> >
> > Therefore, I propose to deprecate the use of Global Aggregate Manager,
> > which can improve the maintainability of the Flink codebase without
> > compromising its functionality.
> >
> > Looking forward to your feedbacks, thanks.
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Deprecate+Global+Aggregator+Manager
> > [2] https://issues.apache.org/jira/browse/FLINK-10886
> > [3] https://issues.apache.org/jira/browse/FLINK-31245
> >
> > Best,
> > Zhanghao Chen
> >



--

Best,
Benchao Li


Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

2023-12-26 Thread Zhanghao Chen
Thanks for driving this effort, Yangze! The proposal overall LGTM. Other from 
the throughput enhancement in the OLAP scenario, the separation of leader 
election/discovery services and the metadata persistence services will also 
make the HA impl clearer and easier to maintain. Just a minor comment on 
naming: would it better to rename PersistentServices to PersistenceServices, as 
usually we put a noun before Services?

Best,
Zhanghao Chen

From: Yangze Guo 
Sent: Tuesday, December 19, 2023 17:33
To: dev 
Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

Hi, there,

We would like to start a discussion thread on "FLIP-403: High
Availability Services for OLAP Scenarios"[1].

Currently, Flink's high availability service consists of two
mechanisms: leader election/retrieval services for JobManager and
persistent services for job metadata. However, these mechanisms are
set up in an "all or nothing" manner. In OLAP scenarios, we typically
only require leader election/retrieval services for JobManager
components since jobs usually do not have a restart strategy.
Additionally, the persistence of job states can negatively impact the
cluster's throughput, especially for short query jobs.

To address these issues, this FLIP proposes splitting the
HighAvailabilityServices into LeaderServices and PersistentServices,
and enable users to independently configure the high availability
strategies specifically related to jobs.

Please find more details in the FLIP wiki document [1]. Looking
forward to your feedback.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios

Best,
Yangze Guo


Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options

2023-12-26 Thread Zhanghao Chen
Thanks everyone. I'll start voting after the New Year's holiday if there's no 
further comment.

Best,
Zhanghao Chen

From: Benchao Li 
Sent: Friday, December 22, 2023 21:18
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP-397: Add config options for administrator JVM 
options

+1 from my side,

I also met some scenarios that I wanted to set some JVM options by
default for all Flink jobs before, such as
'-XX:-DontCompileHugeMethods', without it, some generated big methods
won't be optimized in JVM C2 compiler, leading to poor performance.

Zhanghao Chen  于2023年11月27日周一 20:04写道:
>
> Hi devs,
>
> I'd like to start a discussion on FLIP-397: Add config options for 
> administrator JVM options [1].
>
> In production environments, users typically develop and operate their Flink 
> jobs through a managed platform. Users may need to add JVM options to their 
> Flink applications (e.g. to tune GC options). They typically use the 
> env.java.opts.x series of options to do so. Platform administrators also have 
> a set of JVM options to apply by default, e.g. to use JVM 17, enable GC 
> logging, or apply pretuned GC options, etc. Both use cases will need to set 
> the same series of options and will clobber one another. Similar issues have 
> been described in SPARK-23472 [2].
>
> Therefore, I propose adding a set of default JVM options for administrator 
> use that prepends the user-set extra JVM options.
>
> Looking forward to hearing from you.
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options
> [2] https://issues.apache.org/jira/browse/SPARK-23472
>
> Best,
> Zhanghao Chen



--

Best,
Benchao Li


Re: [VOTE] FLIP-398: Improve Serialization Configuration And Usage In Flink

2023-12-26 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Yong Fang 
Sent: Wednesday, December 27, 2023 14:54
To: dev 
Subject: [VOTE] FLIP-398: Improve Serialization Configuration And Usage In Flink

Hi devs,

Thanks for all feedback about the FLIP-398: Improve Serialization
Configuration And Usage In Flink [1] which has been discussed in [2].

I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or insufficient votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-398%3A+Improve+Serialization+Configuration+And+Usage+In+Flink
[2] https://lists.apache.org/thread/m67s4qfrh660lktpq7yqf9docvvf5o9l

Best,
Fang Yong


Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options

2024-01-01 Thread Zhanghao Chen
Hi Xiangyu,

The proposed new options are targeted on experienced Flink platform 
administrators instead of normal end users, and a one-by-one mapping from 
non-default option to the default option variant might be easier for users to 
understand. Also, although JM and TM tend to use the same set of JVM args in 
most times, there're cases where different set of JVM args are preferable. So I 
am leaning towards the current design, WDYT?

Best,
Zhanghao Chen

From: xiangyu feng 
Sent: Friday, December 29, 2023 20:20
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP-397: Add config options for administrator JVM 
options

Hi Zhanghao,

Thanks for driving this. +1 for the overall idea.

One minor question, do we need separate administrator JVM options for both
JobManager and TaskManager? Or just one administrator JVM option for all?

I'm afraid of 6 jvm
options(env.java.opts.all\env.java.default-opts.all\env.java.opts.jobmanager\env.java.default-opts.jobmanager\env.java.opts.taskmanager\env.java.default-opts.taskmanager)
may confuse users.

Regards,
Xiangyu


Yong Fang  于2023年12月27日周三 15:36写道:

> +1 for this, we have met jobs that need to set GC policies different from
> the default ones to improve performance. Separating the default and
> user-set ones can help us better manage them.
>
> Best,
> Fang Yong
>
> On Fri, Dec 22, 2023 at 9:18 PM Benchao Li  wrote:
>
> > +1 from my side,
> >
> > I also met some scenarios that I wanted to set some JVM options by
> > default for all Flink jobs before, such as
> > '-XX:-DontCompileHugeMethods', without it, some generated big methods
> > won't be optimized in JVM C2 compiler, leading to poor performance.
> >
> > Zhanghao Chen  于2023年11月27日周一 20:04写道:
> > >
> > > Hi devs,
> > >
> > > I'd like to start a discussion on FLIP-397: Add config options for
> > administrator JVM options [1].
> > >
> > > In production environments, users typically develop and operate their
> > Flink jobs through a managed platform. Users may need to add JVM options
> to
> > their Flink applications (e.g. to tune GC options). They typically use
> the
> > env.java.opts.x series of options to do so. Platform administrators also
> > have a set of JVM options to apply by default, e.g. to use JVM 17, enable
> > GC logging, or apply pretuned GC options, etc. Both use cases will need
> to
> > set the same series of options and will clobber one another. Similar
> issues
> > have been described in SPARK-23472 [2].
> > >
> > > Therefore, I propose adding a set of default JVM options for
> > administrator use that prepends the user-set extra JVM options.
> > >
> > > Looking forward to hearing from you.
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options
> > > [2] https://issues.apache.org/jira/browse/SPARK-23472
> > >
> > > Best,
> > > Zhanghao Chen
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Alexander Fedulov

2024-01-02 Thread Zhanghao Chen
Congrats, Alex!

Best,
Zhanghao Chen

From: Maximilian Michels 
Sent: Tuesday, January 2, 2024 20:15
To: dev 
Cc: Alexander Fedulov 
Subject: [ANNOUNCE] New Apache Flink Committer - Alexander Fedulov

Happy New Year everyone,

I'd like to start the year off by announcing Alexander Fedulov as a
new Flink committer.

Alex has been active in the Flink community since 2019. He has
contributed more than 100 commits to Flink, its Kubernetes operator,
and various connectors [1][2].

Especially noteworthy are his contributions on deprecating and
migrating the old Source API functions and test harnesses, the
enhancement to flame graphs, the dynamic rescale time computation in
Flink Autoscaling, as well as all the small enhancements Alex has
contributed which make a huge difference.

Beyond code contributions, Alex has been an active community member
with his activity on the mailing lists [3][4], as well as various
talks and blog posts about Apache Flink [5][6].

Congratulations Alex! The Flink community is proud to have you.

Best,
The Flink PMC

[1] https://github.com/search?type=commits&q=author%3Aafedulov+org%3Aapache
[2] 
https://issues.apache.org/jira/browse/FLINK-28229?jql=status%20in%20(Resolved%2C%20Closed)%20AND%20assignee%20in%20(afedulov)%20ORDER%20BY%20resolved%20DESC%2C%20created%20DESC
[3] https://lists.apache.org/list?dev@flink.apache.org:lte=100M:Fedulov
[4] https://lists.apache.org/list?u...@flink.apache.org:lte=100M:Fedulov
[5] 
https://flink.apache.org/2020/01/15/advanced-flink-application-patterns-vol.1-case-study-of-a-fraud-detection-system/
[6] 
https://www.ververica.com/blog/presenting-our-streaming-concepts-introduction-to-flink-video-series


[DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-01-02 Thread Zhanghao Chen
Dear Flink devs,

I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID 
generation for improved state compatibility on parallelism change [1].

Currently, when user does not explicitly set operator UIDs, the chaining 
behavior will still affect state compatibility, as the generation of the 
Operator ID is dependent on its chained output nodes. For example, a simple 
source->sink DAG with source and sink chained together is state incompatible 
with an otherwise identical DAG with source and sink unchained (either because 
the parallelisms of the two ops are changed to be unequal or chaining is 
disabled). This greatly limits the flexibility to perform 
chain-breaking/building for performance tuning.

The dependency on chained output nodes for Operator ID generation can be traced 
back to Flink 1.2. It is unclear at this point on why chained output nodes are 
involved in the algorithm, but the following history background might be 
related: prior to Flink 1.3, Flink runtime takes the snapshots by the operator 
ID of the first vertex in a chain, so it somewhat makes sense to include 
chained output nodes into the algorithm as chain-breaking/building is expected 
to break state-compatibility anyway.

Given that operator-level state recovery within a chain has long been supported 
since Flink 1.3, I propose to introduce StreamGraphHasherV3 that is agnostic of 
the chaining behavior of operators, so that users are free to tune the 
parallelism of individual operators without worrying about state 
incompatibility. We can make the V3 hasher an optional choice in Flink 1.19, 
and make it the default hasher in 2.0 for backwards compatibility.

Looking forward to your suggestions on it, thanks~

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change

Best,
Zhanghao Chen


Re: [Discuss] FLIP-407: Improve Flink Client performance in interactive scenarios

2024-01-03 Thread Zhanghao Chen
Thanks for driving this effort on improving the interactive use experience of 
Flink. The proposal overall looks good to me.

Best,
Zhanghao Chen

From: xiangyu feng 
Sent: Tuesday, December 26, 2023 16:51
To: dev@flink.apache.org 
Subject: [Discuss] FLIP-407: Improve Flink Client performance in interactive 
scenarios

Hi devs,

I'm opening this thread to discuss FLIP-407: Improve Flink Client
performance in interactive scenarios. The POC test results and design doc
can be found at: FLIP-407
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters>
.

Currently, Flink Client is mainly designed for one time interaction with
the Flink Cluster. All the resources(http connections, threads, ha
services) and instances(ClusterDescriptor, ClusterClient, RestClient) are
created and recycled for each interaction. This works well when users do
not need to interact frequently with Flink Cluster and also saves resource
usage since resources are recycled immediately after each usage.

However, in OLAP or StreamingWarehouse scenarios, users might submit
interactive jobs to a dedicated Flink Session Cluster very often. In this
case, we find that for short queries that can finish in less than 1s in
Flink Cluster will still have E2E latency greater than 2s. Hence, we
propose this FLIP to improve the Flink Client performance in this scenario.
This could also improve the user experience when using session debug mode.

The major change in this FLIP is that there will be a new introduced option
*'execution.interactive-client'*. When this option is enabled, Flink
Client will reuse all the necessary resources to improve interactive
performance, including: HA Services, HTTP connections, threads and all
kinds of instances related to a long-running Flink Cluster. The default
value of this option will be false, then Flink Client will behave as before.

Also, this FLIP proposed a configurable RetryStrategy when fetching results
from client-side to Flink Cluster. In interactive scenarios, this can save
more than 15% of TM CPU usage without performance degradation.

Looking forward to your feedback, thanks.

Best regards,
Xiangyu


[VOTE] FLIP-397: Add config options for administrator JVM options

2024-01-03 Thread Zhanghao Chen
Hi everyone,

Thanks for all the feedbacks on FLIP-397 [1], which proposes to add a set of 
default JVM options for administrator use that prepends the user-set extra JVM 
options for easier platform-wide JVM pre-tuning. It has been discussed in [2].

I'd like to start a vote. The vote will be open for at least 72 hours (until 
January 8th 12:00 GMT) unless there is an objection or insufficient votes.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options
[2] https://lists.apache.org/thread/cflonyrfd1ftmyrpztzj3ywckbq41jzg

Best,
Zhanghao Chen


Re: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations

2024-01-03 Thread Zhanghao Chen
Thanks for driving this. I'm not familiar with the listed advanced Curator 
configs, but the previous added switch for disabling ensemble tracking [1] 
saved us when deploying Flink in a cloud env where ZK can only be accessible 
via URLs. That being said, +1 for the overall idea, these configs may help 
users in certain scenarios sooner or later.

[1] https://issues.apache.org/jira/browse/FLINK-31780

Best,
Zhanghao Chen

From: Alex Nitavsky 
Sent: Thursday, December 14, 2023 21:20
To: dev@flink.apache.org 
Subject: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations

Hi all,

I would like to start a discussion thread for: *FLIP-402: Extend ZooKeeper
Curator configurations *[1]

* Problem statement *
Currently Flink misses several Apache Curator configurations, which could
be useful for Flink deployment with ZooKeeper as HA provider.

* Proposed solution *
We have inspected all possible options for Apache Curator and proposed
those which could be valuable for Flink users:

- high-availability.zookeeper.client.authorization [2]
- high-availability.zookeeper.client.maxCloseWaitMs [3]
- high-availability.zookeeper.client.simulatedSessionExpirationPercent [4]

The proposed way is to reflect those properties into Flink configuration
options for Apache ZooKeeper.

Looking forward to your feedback and suggestions.

Kind regards
Oleksandr

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-402%3A+Extend+ZooKeeper+Curator+configurations
[2]
https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#authorization(java.lang.String,byte%5B%5D)
[3]
https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#maxCloseWaitMs(int)
[4]
https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#simulatedSessionExpirationPercent(int)


Re: FLIP-413: Enable unaligned checkpoints by default

2024-01-07 Thread Zhanghao Chen
Hi Piotr,

As a platform administer who runs kilos of Flink jobs, I'd be against the idea 
to enable unaligned cp by default for our jobs. It may help a significant 
portion of the users, but the subtle issues around unaligned CP for a few jobs 
will probably raise a lot more on-calls and incidents. From my point of view, 
we'd better not enable it by default before removing all the limitations listed 
in 
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/#limitations.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Friday, January 5, 2024 21:41
To: dev 
Subject: FLIP-413: Enable unaligned checkpoints by default

Hi!

I would like to propose by default to enable unaligned checkpoints and also
simultaneously increase the aligned checkpoints timeout from 0ms to 5s. I
think this change is the right one to do for the majority of Flink users.

For more rationale please take a look into the short FLIP-413 [1].

What do you all think?

Best,
Piotrek

https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default


Re: [VOTE] FLIP-397: Add config options for administrator JVM options

2024-01-08 Thread Zhanghao Chen
Thank you all! Closing the vote. The result will be sent in a separate email.

Best,
Zhanghao Chen

From: Zhanghao Chen 
Sent: Thursday, January 4, 2024 10:29
To: dev 
Subject: [VOTE] FLIP-397: Add config options for administrator JVM options

Hi everyone,

Thanks for all the feedbacks on FLIP-397 [1], which proposes to add a set of 
default JVM options for administrator use that prepends the user-set extra JVM 
options for easier platform-wide JVM pre-tuning. It has been discussed in [2].

I'd like to start a vote. The vote will be open for at least 72 hours (until 
January 8th 12:00 GMT) unless there is an objection or insufficient votes.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options
[2] https://lists.apache.org/thread/cflonyrfd1ftmyrpztzj3ywckbq41jzg

Best,
Zhanghao Chen


[RESULT][VOTE] FLIP-397: Add config options for administrator JVM options

2024-01-08 Thread Zhanghao Chen
Hi everyone,

I'm happy to announce that FLIP-397: Add config options for administrator JVM 
options, has been accepted with 4 approving votes (3 binding) [1]:

 - Benchao Li (binding)
 - Rui Fan (binding)
 - Xiangyu Feng (non-binding)
 - Yong Fang (binding)

There're no disapproving votes.

Thanks again to everyone who participated in the discussion and voting.

[1] https://lists.apache.org/thread/5qlr0xl0oyc9dnvcjr0q39pcrzyx4ohb

Best,
Zhanghao Chen


Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-01-09 Thread Zhanghao Chen
Hi David,

Thanks for the comments. AFAIK, unaligned checkpoints are disabled for 
pointwise connections according to [1], let's wait Piotr for confirmation. The 
issue itself is not directly related to this proposal as well. If a user 
manually specifies UIDs for each of the chained operators and has unaligned 
checkpoints enabled, we will encounter the same issue if they decide to break 
the chain on a later restart and try to recover from a retained cp.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/


Best,
Zhanghao Chen

From: David Morávek 
Sent: Wednesday, January 10, 2024 6:26
To: dev@flink.apache.org ; Piotr Nowojski 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for the FLIP. What you're proposing makes a lot of sense +1

Have you thought about how this works with unaligned checkpoints in case
you go from unchained to chained? I think it should be fine because this
scenario should only apply to forward/rebalance scenarios where we, as far
as I recall, force alignment anyway, so there should be no exchanges to
snapshot. It might just work, but something to double-check. Maybe @Piotr
Nowojski  could confirm it.

Best,
D.

On Wed, Jan 3, 2024 at 7:10 AM Zhanghao Chen 
wrote:

> Dear Flink devs,
>
> I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID
> generation for improved state compatibility on parallelism change [1].
>
> Currently, when user does not explicitly set operator UIDs, the chaining
> behavior will still affect state compatibility, as the generation of the
> Operator ID is dependent on its chained output nodes. For example, a simple
> source->sink DAG with source and sink chained together is state
> incompatible with an otherwise identical DAG with source and sink unchained
> (either because the parallelisms of the two ops are changed to be unequal
> or chaining is disabled). This greatly limits the flexibility to perform
> chain-breaking/building for performance tuning.
>
> The dependency on chained output nodes for Operator ID generation can be
> traced back to Flink 1.2. It is unclear at this point on why chained output
> nodes are involved in the algorithm, but the following history background
> might be related: prior to Flink 1.3, Flink runtime takes the snapshots by
> the operator ID of the first vertex in a chain, so it somewhat makes sense
> to include chained output nodes into the algorithm as
> chain-breaking/building is expected to break state-compatibility anyway.
>
> Given that operator-level state recovery within a chain has long been
> supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that
> is agnostic of the chaining behavior of operators, so that users are free
> to tune the parallelism of individual operators without worrying about
> state incompatibility. We can make the V3 hasher an optional choice in
> Flink 1.19, and make it the default hasher in 2.0 for backwards
> compatibility.
>
> Looking forward to your suggestions on it, thanks~
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change
>
> Best,
> Zhanghao Chen
>


Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-01-10 Thread Zhanghao Chen
Hi Yu,

I haven't thought too much about the compatibility design before. By the nature 
of the problem, it's impossible to make V3 compatible with V2, what we can do 
is to somewhat better inform users when switching the hasher, but I don't have 
any good idea so far. Do you have any suggestions on this?

Best,
Zhanghao Chen

From: Yu Chen 
Sent: Thursday, January 11, 2024 13:52
To: dev@flink.apache.org 
Cc: Piotr Nowojski ; zhanghao.c...@outlook.com 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for driving this, that’s really painful for us when we need to switch 
config `pipeline.operator-chaining`.

But I have a Concern, according to FLIP description, modifying `isChainable` 
related code in `StreamGraphHasherV2` will cause the generated operator id to 
be changed, which will result in the user unable to recover from the old state 
(old and new Operator IDs can't be mapped).
Therefore switching Hasher strategy (V2->V3 or V3->V2) will lead to an 
incompatibility, is there any relevant compatibility design considered?

Best,
Yu Chen

2024年1月10日 10:25,Zhanghao Chen  写道:

Hi David,

Thanks for the comments. AFAIK, unaligned checkpoints are disabled for 
pointwise connections according to [1], let's wait Piotr for confirmation. The 
issue itself is not directly related to this proposal as well. If a user 
manually specifies UIDs for each of the chained operators and has unaligned 
checkpoints enabled, we will encounter the same issue if they decide to break 
the chain on a later restart and try to recover from a retained cp.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/


Best,
Zhanghao Chen

From: David Morávek 
Sent: Wednesday, January 10, 2024 6:26
To: dev@flink.apache.org ; Piotr Nowojski 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for the FLIP. What you're proposing makes a lot of sense +1

Have you thought about how this works with unaligned checkpoints in case
you go from unchained to chained? I think it should be fine because this
scenario should only apply to forward/rebalance scenarios where we, as far
as I recall, force alignment anyway, so there should be no exchanges to
snapshot. It might just work, but something to double-check. Maybe @Piotr
Nowojski  could confirm it.

Best,
D.

On Wed, Jan 3, 2024 at 7:10 AM Zhanghao Chen 
wrote:

Dear Flink devs,

I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID
generation for improved state compatibility on parallelism change [1].

Currently, when user does not explicitly set operator UIDs, the chaining
behavior will still affect state compatibility, as the generation of the
Operator ID is dependent on its chained output nodes. For example, a simple
source->sink DAG with source and sink chained together is state
incompatible with an otherwise identical DAG with source and sink unchained
(either because the parallelisms of the two ops are changed to be unequal
or chaining is disabled). This greatly limits the flexibility to perform
chain-breaking/building for performance tuning.

The dependency on chained output nodes for Operator ID generation can be
traced back to Flink 1.2. It is unclear at this point on why chained output
nodes are involved in the algorithm, but the following history background
might be related: prior to Flink 1.3, Flink runtime takes the snapshots by
the operator ID of the first vertex in a chain, so it somewhat makes sense
to include chained output nodes into the algorithm as
chain-breaking/building is expected to break state-compatibility anyway.

Given that operator-level state recovery within a chain has long been
supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that
is agnostic of the chaining behavior of operators, so that users are free
to tune the parallelism of individual operators without worrying about
state incompatibility. We can make the V3 hasher an optional choice in
Flink 1.19, and make it the default hasher in 2.0 for backwards
compatibility.

Looking forward to your suggestions on it, thanks~

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change

Best,
Zhanghao Chen




Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-01-11 Thread Zhanghao Chen
Thanks for the input, Piotr. It might still be possible to make it compatible 
with the old snapshots, following the direction of 
FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. 
I'll discuss with Yu on more details.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Friday, January 12, 2024 1:55
To: Yu Chen 
Cc: Zhanghao Chen ; dev@flink.apache.org 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi,

Using unaligned checkpoints is orthogonal to this FLIP.

Yes, unaligned checkpoints are not supported for pointwise connections, so most 
of the cases go away anyway.
It is possible to switch from unchained to chained subtasks by removing a keyBy 
exchange, and this would be
a problem, but that's just one of the things that we claim that unaligned 
checkpoints do not support [1]. But as
I stated above, this is an orthogonal issue to this FLIP.

Regarding the proposal itself, generally speaking it makes sense to me as well. 
However I'm quite worried about
the compatibility and/or migration path. The:
> (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated.

step would break the compatibility with Flink 1.xx snapshots. But as this is 
for v2.0, maybe that's not the end of
the world?

Best,
Piotrek

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations

czw., 11 sty 2024 o 12:10 Yu Chen 
mailto:yuchen.e...@gmail.com>> napisał(a):
Hi Zhanghao,

Actually, Stefan has done similar compatibility work in the early 
FLINK-5290[1], where he introduced the legacyStreamGraphHashers list for hasher 
backward compatibility.

We have attempted to implement a similar feature in the internal version of 
FLINK and tried to include the new hasher as part of the 
legacyStreamGraphHashers,
which would ensure that the corresponding Operator State could be found at 
restore while ignoring the chaining condition(without changing the default 
hasher).

However, we have found that such a solution may lead to some unexpected 
situations in some cases. While I have no time to find out the root cause 
recently.

If you're interested, I'd be happy to discuss it with you and try to solve the 
problem.

[1] https://issues.apache.org/jira/browse/FLINK-5290

Best,
Yu Chen



2024年1月11日 15:07,Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>> 写道:

Hi Yu,

I haven't thought too much about the compatibility design before. By the nature 
of the problem, it's impossible to make V3 compatible with V2, what we can do 
is to somewhat better inform users when switching the hasher, but I don't have 
any good idea so far. Do you have any suggestions on this?

Best,
Zhanghao Chen

From: Yu Chen mailto:yuchen.e...@gmail.com>>
Sent: Thursday, January 11, 2024 13:52
To: dev@flink.apache.org<mailto:dev@flink.apache.org> 
mailto:dev@flink.apache.org>>
Cc: Piotr Nowojski mailto:piotr.nowoj...@gmail.com>>; 
zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> 
mailto:zhanghao.c...@outlook.com>>
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for driving this, that’s really painful for us when we need to switch 
config `pipeline.operator-chaining`.

But I have a Concern, according to FLIP description, modifying `isChainable` 
related code in `StreamGraphHasherV2` will cause the generated operator id to 
be changed, which will result in the user unable to recover from the old state 
(old and new Operator IDs can't be mapped).
Therefore switching Hasher strategy (V2->V3 or V3->V2) will lead to an 
incompatibility, is there any relevant compatibility design considered?

Best,
Yu Chen

2024年1月10日 10:25,Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>> 写道:

Hi David,

Thanks for the comments. AFAIK, unaligned checkpoints are disabled for 
pointwise connections according to [1], let's wait Piotr for confirmation. The 
issue itself is not directly related to this proposal as well. If a user 
manually specifies UIDs for each of the chained operators and has unaligned 
checkpoints enabled, we will encounter the same issue if they decide to break 
the chain on a later restart and try to recover from a retained cp.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/


Best,
Zhanghao Chen

From: David Morávek mailto:d...@apache.org>>
Sent: Wednesday, January 10, 2024 6:26
To: dev@flink.apache.org<mailto:dev@flink.apache.org> 
mailto:dev@flink.apache.org>>; Piotr Nowojski 
mailto:piotr.nowoj...@gmail.com>>
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator 

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

2023-02-27 Thread Zhanghao Chen
Thanks for driving this topic. I think this FLIP could help clean up the 
codebase to make it easier to maintain. +1 on it.

Best,
Zhanghao Chen

From: Weihua Hu 
Sent: Monday, February 27, 2023 20:40
To: dev 
Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Hi everyone,

I would like to begin a discussion on FLIP-298: Unifying the Implementation
of SlotManager[1]. There are currently two types of SlotManager in Flink:
DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
should behave as DeclarativeSlotManager if the user does not configure the
slot request profile.

Therefore, this FLIP aims to unify the implementation of SlotManager in
order to reduce maintenance costs.

Looking forward to hearing from you.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager

Best,
Weihua


[DISCUSS] Deprecating GlobalAggregateManager

2023-02-27 Thread Zhanghao Chen
Hi dev,

I'd like to open discussion on deprecating Global Aggregate Manager in favor of 
Operator Coordinator.


  1.  Global Aggregate Manager is rarely used and can be replaced by Opeator 
Coordinator. Global Aggregate Manager was introduced in 
[1]<https://issues.apache.org/jira/browse/FLINK-10886> to support event time 
synchronization across sources and more generally, coordination of parallel 
tasks. AFAIK, this was only used in the Kinesis source [2] for an early version 
of watermark alignment. Operator Coordinator, introduced in [3], provides a 
more powerful and elegant solution for that need and is part of the new source 
API standard.
  2.  Global Aggregate Manager manages state in JobMaster object, causing 
problems for adaptive parallelism changes. It maintains a state (the 
accumulators field in JobMaster) in JM memory. The accumulator state content is 
defined in user code. In my company, a user stores task parallelism in the 
accumulator, assuming task parallelism never changes. However, this assumption 
is broken when using adaptive scheduler. See [4] for more details.

Therefore, I think we should deprecate the use of Global Aggregate Manager, 
which can improve the maintainability of the Flink codebase without 
compromising its functionality. Looking forward to your opinions on this.

[1] https://issues.apache.org/jira/browse/FLINK-10886
[2] 
https://github.com/apache/flink-connector-aws/blob/d0817fecdcaa53c4bf039761c2d1a16e8fb9f89b/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/util/JobManagerWatermarkTracker.java
[3] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-SplitEnumerator
[4] [FLINK-31245] Adaptive scheduler does not reset the state of 
GlobalAggregateManager when rescaling - ASF JIRA 
(apache.org)<https://issues.apache.org/jira/browse/FLINK-31245?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>

Best,
Zhanghao Chen


Re: [VOTE] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread Zhanghao Chen
Thanks for driving this. +1 (non-binding)

Best,
Zhanghao Chen

From: David Mor?vek 
Sent: Tuesday, February 28, 2023 21:46
To: dev 
Subject: [VOTE] FLIP-291: Externalized Declarative Resource Management

Hi Everyone,

I want to start the vote on FLIP-291: Externalized Declarative Resource
Management [1]. The FLIP was discussed in this thread [2].

The goal of the FLIP is to enable external declaration of the resource
requirements of a running job.

The vote will last for at least 72 hours (Friday, 3rd of March, 15:00 CET)
unless
there is an objection or insufficient votes.

[1] https://lists.apache.org/thread/b8fnj127jsl5ljg6p4w3c4wvq30cnybh
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management

Best,
D.


Re: [Vote] FLIP-298: Unifying the Implementation of SlotManager

2023-03-10 Thread Zhanghao Chen
Thanks Weihua. +1 (non-binding)

Best,
Zhanghao Chen

From: Weihua Hu 
Sent: Thursday, March 9, 2023 13:27
To: dev 
Subject: [Vote] FLIP-298: Unifying the Implementation of SlotManager

Hi Everyone,

I would like to start the vote on FLIP-298: Unifying the Implementation
of SlotManager [1]. The FLIP was discussed in this thread [2].

This FLIP aims to unify the implementation of SlotManager in
order to reduce maintenance costs.

The vote will last for at least 72 hours (03/14, 15:00 UTC+8)
unless there is an objection or insufficient votes. Thank you all.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
[2]https://lists.apache.org/thread/ocssfxglpc8z7cto3k8p44mrjxwr67r9

Best,
Weihua


Re: [VOTE] FLIP-407: Improve Flink Client performance in interactive scenarios

2024-01-15 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: xiangyu feng 
Sent: Thursday, January 11, 2024 19:44
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-407: Improve Flink Client performance in interactive 
scenarios

Hi all,

I would like to start the vote for FLIP-407: Improve Flink Client
performance in interactive scenarios[1].
This FLIP was discussed in this thread [2].

The vote will be open for at least 72 hours unless there is an objection or
insufficient votes.

Regards,
Xiangyu

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+in+interactive+scenarios
[2] https://lists.apache.org/thread/ccsv66ygffgqbv956bnknbpllj4t24kj


Re: [NOTICE] master branch cannot compile for now

2024-01-26 Thread Zhanghao Chen
Hi devs,

The issue has already been resolved via a separate PR [1]. Feel free to rebase 
master and proceed~

[1] https://github.com/apache/flink/pull/24200


Best,
Zhanghao Chen

From: Benchao Li 
Sent: Friday, January 26, 2024 11:39
To: dev 
Subject: [NOTICE] master branch cannot compile for now

Hi devs,

I merged FLINK-33263[1] this morning (10:16 +8:00), and it based on an
old commit which uses older guava version, so currently the master
branch cannot compile.

Zhanghao has discovered this in FLINK-33264[2], and the hotfix commit
has been proposed in the same PR, hopefully we can merge it after CI
passes (it may take a few hours).

Sorry for the inconvenience.

[1] https://github.com/apache/flink/pull/24128
[2] https://github.com/apache/flink/pull/24133

--

Best,
Benchao Li


Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-07 Thread Zhanghao Chen
After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com>, I've 
updated the FLIP [1] to include a design that allows for compatible hasher 
upgrade by adding StreamGraphHasherV2 to the legacy hasher list, which is 
actually a revival of the idea from FLIP-5290 [2] when StreamGraphHasherV2 was 
introduced in Flink 1.2. We're targeting to make V3 the default hasher in Flink 
1.20 given that state-compatibility is no longer an issue. Take a review when 
you have a chance, and I'd like to especially thank @Yu 
Chen<mailto:yuchen.e...@gmail.com> for the through offline discussion and code 
debugging help to make this possible.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change
[2] https://issues.apache.org/jira/browse/FLINK-5290

Best,
Zhanghao Chen
____
From: Zhanghao Chen 
Sent: Friday, January 12, 2024 10:46
To: Piotr Nowojski ; Yu Chen 
Cc: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Thanks for the input, Piotr. It might still be possible to make it compatible 
with the old snapshots, following the direction of 
FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. 
I'll discuss with Yu on more details.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Friday, January 12, 2024 1:55
To: Yu Chen 
Cc: Zhanghao Chen ; dev@flink.apache.org 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi,

Using unaligned checkpoints is orthogonal to this FLIP.

Yes, unaligned checkpoints are not supported for pointwise connections, so most 
of the cases go away anyway.
It is possible to switch from unchained to chained subtasks by removing a keyBy 
exchange, and this would be
a problem, but that's just one of the things that we claim that unaligned 
checkpoints do not support [1]. But as
I stated above, this is an orthogonal issue to this FLIP.

Regarding the proposal itself, generally speaking it makes sense to me as well. 
However I'm quite worried about
the compatibility and/or migration path. The:
> (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated.

step would break the compatibility with Flink 1.xx snapshots. But as this is 
for v2.0, maybe that's not the end of
the world?

Best,
Piotrek

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations

czw., 11 sty 2024 o 12:10 Yu Chen 
mailto:yuchen.e...@gmail.com>> napisał(a):
Hi Zhanghao,

Actually, Stefan has done similar compatibility work in the early 
FLINK-5290[1], where he introduced the legacyStreamGraphHashers list for hasher 
backward compatibility.

We have attempted to implement a similar feature in the internal version of 
FLINK and tried to include the new hasher as part of the 
legacyStreamGraphHashers,
which would ensure that the corresponding Operator State could be found at 
restore while ignoring the chaining condition(without changing the default 
hasher).

However, we have found that such a solution may lead to some unexpected 
situations in some cases. While I have no time to find out the root cause 
recently.

If you're interested, I'd be happy to discuss it with you and try to solve the 
problem.

[1] https://issues.apache.org/jira/browse/FLINK-5290

Best,
Yu Chen



2024年1月11日 15:07,Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>> 写道:

Hi Yu,

I haven't thought too much about the compatibility design before. By the nature 
of the problem, it's impossible to make V3 compatible with V2, what we can do 
is to somewhat better inform users when switching the hasher, but I don't have 
any good idea so far. Do you have any suggestions on this?

Best,
Zhanghao Chen

From: Yu Chen mailto:yuchen.e...@gmail.com>>
Sent: Thursday, January 11, 2024 13:52
To: dev@flink.apache.org<mailto:dev@flink.apache.org> 
mailto:dev@flink.apache.org>>
Cc: Piotr Nowojski mailto:piotr.nowoj...@gmail.com>>; 
zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> 
mailto:zhanghao.c...@outlook.com>>
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for driving this, that’s really painful for us when we need to switch 
config `pipeline.operator-chaining`.

But I have a Concern, according to FLIP description, modifying `isChainable` 
related code in `StreamGraphHasherV2` will cause the generated operator id to 
be changed, which will result in the user unable to recover from the old state 
(old and new Operator IDs can't be mapped

Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-07 Thread Zhanghao Chen
Hi Chesnay,

AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can 
share how you allow UID setting for SQL jobs. We've explored providing a 
visualized DAG editor for SQL jobs that allows UID setting on our internal 
platform, but most users found it too complicated to use. Another possible way 
is to utilize SQL hints, but that's complicated as well. From our experience, 
many SQL users are not familiar with Flink, what they want is an experience 
similar to writing a normal SQL in MySQL, without involving much extra concepts 
like the DAG and the UID. In fact, some DataStream and PyFlink users also share 
the same concern.

On the other hand, some performance-tuning is inevitable for a long-running 
jobs in production, and parallelism tuning is among the most common techniques. 
FLIP-367 [1] and FLIP-146 [2] allow user to tune the parallelism of source and 
sinks, and both are well-received in the discussion thread. Users definitely 
don't want to lost state after a parallelism tuning, which is highly risky at 
present.

Putting these together, I think the FLIP has a high value in production. 
Through offline discussion, I leant that multiple companies have developed or 
trying to develop similar hasher changes in their internal distribution, 
including ByteDance, Xiaohongshu, and Bilibili. It'll be great if we can 
improve the SQL experience for all community users as well, WDYT?

Best,
Zhanghao Chen

From: Chesnay Schepler 
Sent: Thursday, February 8, 2024 2:01
To: dev@flink.apache.org ; Zhanghao Chen 
; Piotr Nowojski ; Yu Chen 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

The FLIP is a bit weird to be honest. It only applies in cases where
users haven't set uids, but that goes against best-practices and as far
as I'm told SQL also sets UIDs everywhere.

I'm wondering if this is really worth the effort.

On 07/02/2024 10:23, Zhanghao Chen wrote:
> After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com>, I've 
> updated the FLIP [1] to include a design that allows for compatible hasher 
> upgrade by adding StreamGraphHasherV2 to the legacy hasher list, which is 
> actually a revival of the idea from FLIP-5290 [2] when StreamGraphHasherV2 
> was introduced in Flink 1.2. We're targeting to make V3 the default hasher in 
> Flink 1.20 given that state-compatibility is no longer an issue. Take a 
> review when you have a chance, and I'd like to especially thank @Yu 
> Chen<mailto:yuchen.e...@gmail.com> for the through offline discussion and 
> code debugging help to make this possible.
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change
> [2] https://issues.apache.org/jira/browse/FLINK-5290
>
> Best,
> Zhanghao Chen
> 
> From: Zhanghao Chen 
> Sent: Friday, January 12, 2024 10:46
> To: Piotr Nowojski ; Yu Chen 
> Cc: dev@flink.apache.org 
> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
> improved state compatibility on parallelism change
>
> Thanks for the input, Piotr. It might still be possible to make it compatible 
> with the old snapshots, following the direction of 
> FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. 
> I'll discuss with Yu on more details.
>
> Best,
> Zhanghao Chen
> 
> From: Piotr Nowojski 
> Sent: Friday, January 12, 2024 1:55
> To: Yu Chen 
> Cc: Zhanghao Chen ; dev@flink.apache.org 
> 
> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
> improved state compatibility on parallelism change
>
> Hi,
>
> Using unaligned checkpoints is orthogonal to this FLIP.
>
> Yes, unaligned checkpoints are not supported for pointwise connections, so 
> most of the cases go away anyway.
> It is possible to switch from unchained to chained subtasks by removing a 
> keyBy exchange, and this would be
> a problem, but that's just one of the things that we claim that unaligned 
> checkpoints do not support [1]. But as
> I stated above, this is an orthogonal issue to this FLIP.
>
> Regarding the proposal itself, generally speaking it makes sense to me as 
> well. However I'm quite worried about
> the compatibility and/or migration path. The:
>> (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated.
> step would break the compatibility with Flink 1.xx snapshots. But as this is 
> for v2.0, maybe that's not the end of
> the world?
>
> Best,
> Piotrek
>
> [1] 
> https://nightlies.apach

Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-08 Thread Zhanghao Chen
Hi Piotr,

Thanks for the comment. I agree that compiled plan is the ultimate tool for 
Flink SQL if one wants to make any changes to
query later, and this FLIP indeed is not essential in this sense. However, 
compiled plan is still too complicated for Flink newbies from my point of view. 
As I mentioned previously, our internal platform provides a visualized tool for 
editing the compiled plan but most users still find it complex. Therefore, the 
FLIP can still benefit users with better useability and the proposed changes 
are actually quite lightweight (just copying a new hasher with 2 lines deleted 
+ extending the OperatorIdPair data structure) without much extra effort.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Thursday, February 8, 2024 14:50
To: Zhanghao Chen 
Cc: Chesnay Schepler ; dev@flink.apache.org 
; Yu Chen 
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hey

> AFAIK, there's no way to set UIDs for a SQL job,

AFAIK you can't set UID manually, but  Flink SQL generates a compiled plan
of a query with embedded UIDs. As I understand it, using a compiled plan is
the preferred (only?) way for Flink SQL if one wants to make any changes to
query later on or support Flink's runtime upgrades, without losing the
state.

If that's the case, what would be the usefulness of this FLIP? Only for
DataStream API for users that didn't know that they should have manually
configured UIDs? But they have the workaround to actually post-factum add
the UIDs anyway, right? So maybe indeed Chesnay is right that this FLIP is
not that helpful/worth the extra effort?

Best,
Piotrek

czw., 8 lut 2024 o 03:55 Zhanghao Chen 
napisał(a):

> Hi Chesnay,
>
> AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can
> share how you allow UID setting for SQL jobs. We've explored providing a
> visualized DAG editor for SQL jobs that allows UID setting on our internal
> platform, but most users found it too complicated to use. Another
> possible way is to utilize SQL hints, but that's complicated as well. From
> our experience, many SQL users are not familiar with Flink, what they want
> is an experience similar to writing a normal SQL in MySQL, without
> involving much extra concepts like the DAG and the UID. In fact, some
> DataStream and PyFlink users also share the same concern.
>
> On the other hand, some performance-tuning is inevitable for a
> long-running jobs in production, and parallelism tuning is among the most
> common techniques. FLIP-367 [1] and FLIP-146 [2] allow user to tune the
> parallelism of source and sinks, and both are well-received in the
> discussion thread. Users definitely don't want to lost state after a
> parallelism tuning, which is highly risky at present.
>
> Putting these together, I think the FLIP has a high value in production.
> Through offline discussion, I leant that multiple companies have developed
> or trying to develop similar hasher changes in their internal distribution,
> including ByteDance, Xiaohongshu, and Bilibili. It'll be great if we can
> improve the SQL experience for all community users as well, WDYT?
>
> Best,
> Zhanghao Chen
> ------
> *From:* Chesnay Schepler 
> *Sent:* Thursday, February 8, 2024 2:01
> *To:* dev@flink.apache.org ; Zhanghao Chen <
> zhanghao.c...@outlook.com>; Piotr Nowojski ; Yu
> Chen 
> *Subject:* Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID
> generation for improved state compatibility on parallelism change
>
> The FLIP is a bit weird to be honest. It only applies in cases where
> users haven't set uids, but that goes against best-practices and as far
> as I'm told SQL also sets UIDs everywhere.
>
> I'm wondering if this is really worth the effort.
>
> On 07/02/2024 10:23, Zhanghao Chen wrote:
> > After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com
> >, I've updated the FLIP [1] to include a design
> that allows for compatible hasher upgrade by adding StreamGraphHasherV2 to
> the legacy hasher list, which is actually a revival of the idea from
> FLIP-5290 [2] when StreamGraphHasherV2 was introduced in Flink 1.2. We're
> targeting to make V3 the default hasher in Flink 1.20 given that
> state-compatibility is no longer an issue. Take a review when you have a
> chance, and I'd like to especially thank @Yu Chen<
> mailto:yuchen.e...@gmail.com > for the through
> offline discussion and code debugging help to make this possible.
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on

Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-08 Thread Zhanghao Chen
We're only concerned with parallelism tuning here (with the same Flink 
version).  The plans will be compatible as long as the operator IDs keep the 
same. Currently, this only holds if we do not break/create a chain, and we want 
to make it hold when we break/create a chain as well. That's what the FLIP is 
all about.

The typical user story is that one has a job with a uniform parallelism in the 
first place. The source is chained with an expensive operator. Later on, the 
job parallelism needs to be increased, but the source can't due to limits like 
Kafka partition number. The user then configures different parallelism for the 
source and the remaining part of the job, which breaks a chain, and leads to 
state-incompatibility.

Best,
Zhanghao Chen

From: Chesnay Schepler 
Sent: Thursday, February 8, 2024 18:12
To: dev@flink.apache.org ; Martijn Visser 

Cc: Yu Chen 
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

How exactly are you tuning SQL jobs without compiled plans while
ensuring that the resulting compiled plans are compatible? That's
explicitly not supported by Flink, hence why CompiledPlans exist.
If you change _anything_ the planner is free to generate a completely
different plan, where you have no guarantees that you can map the state
between one another.

On 08/02/2024 09:42, Martijn Visser wrote:
> Hi,
>
>> However, compiled plan is still too complicated for Flink newbies from my 
>> point of view.
> I don't think that the compiled plan was ever positioned to be a
> simple solution. If you want to have an easy approach, we have a
> declarative solution in place with SQL and/or the Table API imho.
>
> Best regards,
>
> Martijn
>
> On Thu, Feb 8, 2024 at 9:14 AM Zhanghao Chen  
> wrote:
>> Hi Piotr,
>>
>> Thanks for the comment. I agree that compiled plan is the ultimate tool for 
>> Flink SQL if one wants to make any changes to
>> query later, and this FLIP indeed is not essential in this sense. However, 
>> compiled plan is still too complicated for Flink newbies from my point of 
>> view. As I mentioned previously, our internal platform provides a visualized 
>> tool for editing the compiled plan but most users still find it complex. 
>> Therefore, the FLIP can still benefit users with better useability and the 
>> proposed changes are actually quite lightweight (just copying a new hasher 
>> with 2 lines deleted + extending the OperatorIdPair data structure) without 
>> much extra effort.
>>
>> Best,
>> Zhanghao Chen
>> 
>> From: Piotr Nowojski 
>> Sent: Thursday, February 8, 2024 14:50
>> To: Zhanghao Chen 
>> Cc: Chesnay Schepler ; dev@flink.apache.org 
>> ; Yu Chen 
>> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation 
>> for improved state compatibility on parallelism change
>>
>> Hey
>>
>>> AFAIK, there's no way to set UIDs for a SQL job,
>> AFAIK you can't set UID manually, but  Flink SQL generates a compiled plan
>> of a query with embedded UIDs. As I understand it, using a compiled plan is
>> the preferred (only?) way for Flink SQL if one wants to make any changes to
>> query later on or support Flink's runtime upgrades, without losing the
>> state.
>>
>> If that's the case, what would be the usefulness of this FLIP? Only for
>> DataStream API for users that didn't know that they should have manually
>> configured UIDs? But they have the workaround to actually post-factum add
>> the UIDs anyway, right? So maybe indeed Chesnay is right that this FLIP is
>> not that helpful/worth the extra effort?
>>
>> Best,
>> Piotrek
>>
>> czw., 8 lut 2024 o 03:55 Zhanghao Chen 
>> napisał(a):
>>
>>> Hi Chesnay,
>>>
>>> AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can
>>> share how you allow UID setting for SQL jobs. We've explored providing a
>>> visualized DAG editor for SQL jobs that allows UID setting on our internal
>>> platform, but most users found it too complicated to use. Another
>>> possible way is to utilize SQL hints, but that's complicated as well. From
>>> our experience, many SQL users are not familiar with Flink, what they want
>>> is an experience similar to writing a normal SQL in MySQL, without
>>> involving much extra concepts like the DAG and the UID. In fact, some
>>> DataStream and PyFlink users also share the same concern.
>>>
>>> On the other hand, some performance-tuning is

Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun

2024-02-19 Thread Zhanghao Chen
Congrats, Jiaba!

Best,
Zhanghao Chen

From: Qingsheng Ren 
Sent: Monday, February 19, 2024 17:53
To: dev ; jiabao...@apache.org 
Subject: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun

Hi everyone,

On behalf of the PMC, I'm happy to announce Jiabao Sun as a new Flink
Committer.

Jiabao began contributing in August 2022 and has contributed 60+ commits
for Flink main repo and various connectors. His most notable contribution
is being the core author and maintainer of MongoDB connector, which is
fully functional in DataStream and Table/SQL APIs. Jiabao is also the
author of FLIP-377 and the main contributor of JUnit 5 migration in runtime
and table planner modules.

Beyond his technical contributions, Jiabao is an active member of our
community, participating in the mailing list and consistently volunteering
for release verifications and code reviews with enthusiasm.

Please join me in congratulating Jiabao for becoming an Apache Flink
committer!

Best,
Qingsheng (on behalf of the Flink PMC)


Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2024-02-28 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Yong Fang 
Sent: Wednesday, February 28, 2024 10:12
To: dev 
Subject: [VOTE] FLIP-314: Support Customized Job Lineage Listener

Hi devs,

I would like to restart a vote about FLIP-314: Support Customized Job
Lineage Listener[1].

Previously, we added lineage related interfaces in FLIP-314. Before the
interfaces were developed and merged into the master, @Maciej and
@Zhenqiu provided valuable suggestions for the interface from the
perspective of the lineage system. So we updated the interfaces of FLIP-314
and discussed them again in the discussion thread [2].

So I am here to initiate a new vote on FLIP-314, the vote will be open for
at least 72 hours unless there is an objection or insufficient votes

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
[2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc

Best,
Fang Yong


Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread Zhanghao Chen
Congratulations!

Best,
Zhanghao Chen

From: Yu Li 
Sent: Thursday, March 28, 2024 15:55
To: d...@paimon.apache.org 
Cc: dev ; user 
Subject: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

CC the Flink user and dev mailing list.

Paimon originated within the Flink community, initially known as Flink
Table Store, and all our incubating mentors are members of the Flink
Project Management Committee. I am confident that the bonds of
enduring friendship and close collaboration will continue to unite the
two communities.

And congratulations all!

Best Regards,
Yu

On Wed, 27 Mar 2024 at 20:35, Guojun Li  wrote:
>
> Congratulations!
>
> Best,
> Guojun
>
> On Wed, Mar 27, 2024 at 5:24 PM wulin  wrote:
>
> > Congratulations~
> >
> > > 2024年3月27日 15:54,王刚  写道:
> > >
> > > Congratulations~
> > >
> > >> 2024年3月26日 10:25,Jingsong Li  写道:
> > >>
> > >> Hi Paimon community,
> > >>
> > >> I’m glad to announce that the ASF board has approved a resolution to
> > >> graduate Paimon into a full Top Level Project. Thanks to everyone for
> > >> your help to get to this point.
> > >>
> > >> I just created an issue to track the things we need to modify [2],
> > >> please comment on it if you feel that something is missing. You can
> > >> refer to apache documentation [1] too.
> > >>
> > >> And, we already completed the GitHub repo migration [3], please update
> > >> your local git repo to track the new repo [4].
> > >>
> > >> You can run the following command to complete the remote repo tracking
> > >> migration.
> > >>
> > >> git remote set-url origin https://github.com/apache/paimon.git
> > >>
> > >> If you have a different name, please change the 'origin' to your remote
> > name.
> > >>
> > >> Please join me in celebrating!
> > >>
> > >> [1]
> > https://incubator.apache.org/guides/transferring.html#life_after_graduation
> > >> [2] https://github.com/apache/paimon/issues/3091
> > >> [3] https://issues.apache.org/jira/browse/INFRA-25630
> > >> [4] https://github.com/apache/paimon
> > >>
> > >> Best,
> > >> Jingsong Lee
> >
> >


Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan

2024-04-16 Thread Zhanghao Chen
Congrats, Zakelly!

Best,
Zhanghao Chen

From: Shawn Huang 
Sent: Wednesday, April 17, 2024 9:59
To: dev@flink.apache.org 
Subject: Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan

Congratulations, Zakelly!

Best,
Shawn Huang


Feng Jin  于2024年4月17日周三 00:16写道:

> Congratulations!
>
> Best,
> Feng
>
> On Tue, Apr 16, 2024 at 10:43 PM Ferenc Csaky 
> wrote:
>
> > Congratulations!
> >
> > Best,
> > Ferenc
> >
> >
> >
> >
> > On Tuesday, April 16th, 2024 at 16:28, Jeyhun Karimov <
> > je.kari...@gmail.com> wrote:
> >
> > >
> > >
> > > Congratulations Zakelly!
> > >
> > > Regards,
> > > Jeyhun
> > >
> > > On Tue, Apr 16, 2024 at 6:35 AM Feifan Wang zoltar9...@163.com wrote:
> > >
> > > > Congratulations, Zakelly!——
> > > >
> > > > Best regards,
> > > >
> > > > Feifan Wang
> > > >
> > > > At 2024-04-15 10:50:06, "Yuan Mei" yuanmei.w...@gmail.com wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > On behalf of the PMC, I'm happy to let you know that Zakelly Lan
> has
> > > > > become
> > > > > a new Flink Committer!
> > > > >
> > > > > Zakelly has been continuously contributing to the Flink project
> since
> > > > > 2020,
> > > > > with a focus area on Checkpointing, State as well as frocksdb (the
> > default
> > > > > on-disk state db).
> > > > >
> > > > > He leads several FLIPs to improve checkpoints and state APIs,
> > including
> > > > > File Merging for Checkpoints and configuration/API reorganizations.
> > He is
> > > > > also one of the main contributors to the recent efforts of
> > "disaggregated
> > > > > state management for Flink 2.0" and drives the entire discussion in
> > the
> > > > > mailing thread, demonstrating outstanding technical depth and
> > breadth of
> > > > > knowledge.
> > > > >
> > > > > Beyond his technical contributions, Zakelly is passionate about
> > helping
> > > > > the
> > > > > community in numerous ways. He spent quite some time setting up the
> > Flink
> > > > > Speed Center and rebuilding the benchmark pipeline after the
> > original one
> > > > > was out of lease. He helps build frocksdb and tests for the
> upcoming
> > > > > frocksdb release (bump rocksdb from 6.20.3->8.10).
> > > > >
> > > > > Please join me in congratulating Zakelly for becoming an Apache
> Flink
> > > > > committer!
> > > > >
> > > > > Best,
> > > > > Yuan (on behalf of the Flink PMC)
> >
>


[DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI

2024-05-28 Thread Zhanghao Chen
Hi all,

I'd like start a discussion on FLIP-460: Display source/sink I/O metrics on 
Flink Web UI [1].

Currently, the numRecordsIn & numBytesIn metrics for sources and the 
numRecordsOut & numBytesOut metrics for sinks are always 0 on the Flink web 
dashboard. It is especially confusing for simple ETL jobs where there's a 
single chained operator with 0 input rate and 0 output rate. For years, Flink 
newbies have been asking "Why my job has zero consumption rate and zero 
production rate, is it actually working?"

Connectors implementing FLIP-33 [2] have already exposed these metrics on the 
operator level, this FLIP takes a further step to expose them on the job 
overview page on Flink Web UI.

Looking forward to everyone's feedback and suggestions. Thanks!

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355
[2] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics

Best,
Zhanghao Chen


Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo

2024-06-04 Thread Zhanghao Chen
Congrats, Weijie!

Best,
Zhanghao Chen

From: Hang Ruan 
Sent: Tuesday, June 4, 2024 16:37
To: dev@flink.apache.org 
Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo

Congratulations Weijie!

Best,
Hang

Yanfei Lei  于2024年6月4日周二 16:24写道:

> Congratulations!
>
> Best,
> Yanfei
>
> Leonard Xu  于2024年6月4日周二 16:20写道:
> >
> > Congratulations!
> >
> > Best,
> > Leonard
> >
> > > 2024年6月4日 下午4:02,Yangze Guo  写道:
> > >
> > > Congratulations!
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Jun 4, 2024 at 4:00 PM Weihua Hu 
> wrote:
> > >>
> > >> Congratulations, Weijie!
> > >>
> > >> Best,
> > >> Weihua
> > >>
> > >>
> > >> On Tue, Jun 4, 2024 at 3:03 PM Yuxin Tan 
> wrote:
> > >>
> > >>> Congratulations, Weijie!
> > >>>
> > >>> Best,
> > >>> Yuxin
> > >>>
> > >>>
> > >>> Yuepeng Pan  于2024年6月4日周二 14:57写道:
> > >>>
> > >>>> Congratulations !
> > >>>>
> > >>>>
> > >>>> Best,
> > >>>> Yuepeng Pan
> > >>>>
> > >>>> At 2024-06-04 14:45:45, "Xintong Song" 
> wrote:
> > >>>>> Hi everyone,
> > >>>>>
> > >>>>> On behalf of the PMC, I'm very happy to announce that Weijie Guo
> has
> > >>>> joined
> > >>>>> the Flink PMC!
> > >>>>>
> > >>>>> Weijie has been an active member of the Apache Flink community for
> many
> > >>>>> years. He has made significant contributions in many components,
> > >>> including
> > >>>>> runtime, shuffle, sdk, connectors, etc. He has driven /
> participated in
> > >>>>> many FLIPs, authored and reviewed hundreds of PRs, been
> consistently
> > >>>> active
> > >>>>> on mailing lists, and also helped with release management of 1.20
> and
> > >>>>> several other bugfix releases.
> > >>>>>
> > >>>>> Congratulations and welcome Weijie!
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Xintong (on behalf of the Flink PMC)
> > >>>>
> > >>>
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Fan Rui

2024-06-05 Thread Zhanghao Chen
Congrats, Rui!

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Wednesday, June 5, 2024 18:01
To: dev ; rui fan <1996fan...@gmail.com>
Subject: [ANNOUNCE] New Apache Flink PMC Member - Fan Rui

Hi everyone,

On behalf of the PMC, I'm very happy to announce another new Apache Flink
PMC Member - Fan Rui.

Rui has been active in the community since August 2019. During this time he
has contributed a lot of new features. Among others:
  - Decoupling Autoscaler from Kubernetes Operator, and supporting
Standalone Autoscaler
  - Improvements to checkpointing, flamegraphs, restart strategies,
watermark alignment, network shuffles
  - Optimizing the memory and CPU usage of large operators, greatly
reducing the risk and probability of TaskManager OOM

He reviewed a significant amount of PRs and has been active both on the
mailing lists and in Jira helping to both maintain and grow Apache Flink's
community. He is also our current Flink 1.20 release manager.

In the last 12 months, Rui has been the most active contributor in the
Flink Kubernetes Operator project, while being the 2nd most active Flink
contributor at the same time.

Please join me in welcoming and congratulating Fan Rui!

Best,
Piotrek (on behalf of the Flink PMC)


Re: Poor Load Balancing across TaskManagers for Multiple Kafka Sources

2024-06-05 Thread Zhanghao Chen
Hi Kevin,

The problem here is about how to evenly distribute partitions from multiple 
Kafka topics to tasks, while FLIP-370 is only concerned about how to evenly 
distribute tasks to slots & taskmanagers, so FLIP-370 won't help here.

Best,
Zhanghao Chen

From: Kevin Lam 
Sent: Thursday, June 6, 2024 2:32
To: dev@flink.apache.org 
Subject: Poor Load Balancing across TaskManagers for Multiple Kafka Sources

Hey all,

I'm seeing an issue with poor load balancing across TaskManagers for Kafka
Sources using the Flink SQL API and wondering if FLIP-370 will help with
it, or if not, interested in any ideas the community has to mitigate the
issue.

The Kafka SplitEnumerator uses the following logic to assign split owners (code
pointer
<https://github.com/apache/flink-connector-kafka/blob/00c9c8c74121136a0c1710ac77f307dc53adae99/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java#L469>
):

```
  static int getSplitOwner(TopicPartition tp, int numReaders) {
int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFF) %
numReaders;
return (startIndex + tp.partition()) % numReaders;
}
```

However this can result in imbalanced distribution of kafka partition
consumers across task managers.

To illustrate, I created a pipeline that consumes from 2 kafka topics, each
with 8 partitions, and just sinks them to a blackhole connector sink. For a
parallelism of 16 and 1 task slot per TaskManager, we'd ideally expect each
TaskManager to get its own kafka partition. ie. 16 partitions (8 partitions
from each topic) split evenly across TaskManagers. However, due the
algorithm I linked and how the startIndex is computed, I have observed a
bunch of TaskManagers with 2 partitions (one from each topic), and some
TaskManager completely idle.

I've also run an experiment with the same pipeline where I set parallelism
such that each task manager gets exactly 1 partition, and compared it
against when each task manager gets exactly 2 partitions (one from each
topic). I ensured this was the case by setting an appropriate parallelism,
and ran the jobs on an application cluster. Since the partitions are fixed,
the extra parallelism if any isn't used. The case where there is exactly 1
partition per TaskManager processes a fixed set of data 20% faster.

I was reading FLIP-370
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling>
and understand it will improve task scheduling in certain scenarios. Will
FLIP-370 help with this KafkaSource scenario? If not any ideas to improve
the subtask scheduling for KafkaSources? Ideally we don't need to carefully
consider the partition + resulting task distribution when selecting our
parallelism values.

Thanks for your help!


Re: [June 15 Feature Freeze][SUMMARY] Flink 1.20 Release Sync 11/06/2024

2024-06-11 Thread Zhanghao Chen
Hi Rui,

Thanks for the summary! A quick update here: FLIP-398 was decided not to go 
into 1.20, as it was just found that the effort to add dedicated serialization 
support for Maps, Sets and Lists, will break state-compatibility. I will revert 
the relevant changes soon.

Best,
Zhanghao Chen

From: Rui Fan <1996fan...@gmail.com>
Sent: Wednesday, June 12, 2024 12:59
To: dev 
Subject: [June 15 Feature Freeze][SUMMARY] Flink 1.20 Release Sync 11/06/2024

Dear devs,

This is the sixth meeting for Flink 1.20 release[1] cycle.

I'd like to share the information synced in the meeting.

- Feature Freeze

It is worth noting that there are only 3 days left until the
feature freeze time(June 15, 2024, 00:00 CEST(UTC+2)),
and developers need to pay attention to the feature freeze time.

After checked with all contributors of 1.20 FLIPs, we don't need
to postpone the feature freeze time. Please reply to this email
if other features are valuable and it's better to be merged in 1.20, thanks.

- Features:

So far we've had 16 flips/features:
- 6 flips/features are done
- 8 flips/features are doing and release managers checked with
corresponding contributors
  - 7 of these flips/features can be completed before June 15, 2024, 00:00
CEST(UTC+2)
  - We were unable to contact the contributor of FLIP-436
- 2 flips/features won't make in 1.20

- Blockers:

We don't have any blocker right now, thanks to everyone who fixed blockers
before.

- Sync meeting[2]:

The next meeting is 18/06/2024 10am (UTC+2) and 4pm (UTC+8), please
feel free to join us.

Lastly, we encourage attendees to fill out the topics to be discussed at
the bottom of 1.20 wiki page[1] a day in advance, to make it easier for
everyone to understand the background of the topics, thanks!

[1] https://cwiki.apache.org/confluence/display/FLINK/1.20+Release
[2] https://meet.google.com/mtj-huez-apu

Best,
Robert, Weijie, Ufuk and Rui


Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

2024-06-11 Thread Zhanghao Chen
Thanks for driving this, Weijie. Usually, the data distribution of the external 
system is closely related to the keys, e.g. computing the bucket index by key 
hashcode % bucket num, so I'm not sure about how much difference there are 
between partitioning by key and a custom partitioning strategy. Could you give 
a more concrete example in production on when a custom partitioning strategy 
will outperform partitioning by key? Since you've mentioned Paimon in doc, 
maybe an example on Paimon.

Best,
Zhanghao Chen

From: weijie guo 
Sent: Friday, June 7, 2024 9:59
To: dev 
Subject: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream 
of Lookup Join

Hi devs,


I'd like to start a discussion about FLIP-462[1]: Support Custom Data
Distribution for Input Stream of Lookup Join.


Lookup Join is an important feature in Flink, It is typically used to
enrich a table with data that is queried from an external system.
If we interact with the external systems for each incoming record, we
incur significant network IO and RPC overhead.

Therefore, most connectors introduce caching to reduce the per-record
level query overhead. However, because the data distribution of Lookup
Join's input stream is arbitrary, the cache hit rate is sometimes
unsatisfactory.


We want to introduce a mechanism for the connector to tell the Flink
planner its desired input stream data distribution or partitioning
strategy. This can significantly reduce the amount of cached data and
improve performance of Lookup Join.


You can find more details in this FLIP[1]. Looking forward to hearing
from you, thanks!


Best regards,

Weijie


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join


Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

2024-06-11 Thread Zhanghao Chen
Thanks for the clarification, that makes sense. +1 for the proposal.

Best,
Zhanghao Chen

From: weijie guo 
Sent: Wednesday, June 12, 2024 14:20
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input 
Stream of Lookup Join

Hi Zhanghao,

Thanks for the reply!

> Could you give a more concrete example in production on when a custom
partitioning strategy will outperform partitioning by key

The key point here is partitioning logic cannot be fully expressed with all
or part of the join key. That is, even if we know which fields are used to
calculate buckets, still have to face the following problem:

1. The mapping from the bucket field to the bucket id is not necessarily
done via hashcode, and even if it is, the hash algorithm may be different
from the one used in Flink. The planner can't know how to do this mapping.
2. In order to get the bucket id, we have to mod the bucket number, but
planner has no notion of bucket number.



Best regards,

Weijie


Zhanghao Chen  于2024年6月12日周三 13:55写道:

> Thanks for driving this, Weijie. Usually, the data distribution of the
> external system is closely related to the keys, e.g. computing the bucket
> index by key hashcode % bucket num, so I'm not sure about how much
> difference there are between partitioning by key and a custom partitioning
> strategy. Could you give a more concrete example in production on when a
> custom partitioning strategy will outperform partitioning by key? Since
> you've mentioned Paimon in doc, maybe an example on Paimon.
>
> Best,
> Zhanghao Chen
> 
> From: weijie guo 
> Sent: Friday, June 7, 2024 9:59
> To: dev 
> Subject: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input
> Stream of Lookup Join
>
> Hi devs,
>
>
> I'd like to start a discussion about FLIP-462[1]: Support Custom Data
> Distribution for Input Stream of Lookup Join.
>
>
> Lookup Join is an important feature in Flink, It is typically used to
> enrich a table with data that is queried from an external system.
> If we interact with the external systems for each incoming record, we
> incur significant network IO and RPC overhead.
>
> Therefore, most connectors introduce caching to reduce the per-record
> level query overhead. However, because the data distribution of Lookup
> Join's input stream is arbitrary, the cache hit rate is sometimes
> unsatisfactory.
>
>
> We want to introduce a mechanism for the connector to tell the Flink
> planner its desired input stream data distribution or partitioning
> strategy. This can significantly reduce the amount of cached data and
> improve performance of Lookup Join.
>
>
> You can find more details in this FLIP[1]. Looking forward to hearing
> from you, thanks!
>
>
> Best regards,
>
> Weijie
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join
>


Re: [VOTE] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

2024-06-16 Thread Zhanghao Chen
+1 (unbinding)

Best,
Zhanghao Chen

From: weijie guo 
Sent: Monday, June 17, 2024 10:13
To: dev 
Subject: [VOTE] FLIP-462: Support Custom Data Distribution for Input Stream of 
Lookup Join

Hi everyone,


Thanks for all the feedback about the FLIP-462: Support Custom Data
Distribution for Input Stream of Lookup Join [1]. The discussion
thread is here [2].


The vote will be open for at least 72 hours unless there is an
objection or insufficient votes.


Best,

Weijie



[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join


[2] https://lists.apache.org/thread/kds2zrcdmykrz5lmn0hf9m4phdl60nfb


Re: [VOTE] FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-17 Thread Zhanghao Chen
+1 (non-binding)

Best,
Zhanghao Chen

From: Matthias Pohl 
Sent: Monday, June 17, 2024 16:24
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-461: Synchronize rescaling with checkpoint creation to 
minimize reprocessing for the AdaptiveScheduler

Hi everyone,
the discussion in [1] about FLIP-461 [2] is kind of concluded. I am
starting a vote on this one here.

The vote will be open for at least 72 hours (i.e. until June 20, 2024;
8:30am UTC) unless there are any objections. The FLIP will be considered
accepted if 3 binding votes (from active committers according to the Flink
bylaws [3]) are gathered by the community.

Best,
Matthias

[1] https://lists.apache.org/thread/nnkonmsv8xlk0go2sgtwnphkhrr5oc3y
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
[3]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Bylaws#FlinkBylaws-Approvals


Re: [DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI

2024-07-15 Thread Zhanghao Chen
Thanks for all. As the discussion is quite concluded, I'll start voting in a 
separate thread.

From: Leonard Xu 
Sent: Monday, July 15, 2024 11:22
To: dev 
Subject: Re: [DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI

Thanks for zhanghao for kicking off the thread.

>  It is especially confusing for simple ETL jobs where there's a single 
> chained operator with 0 input rate and 0 output rate.


This case have confused flink users for a long time, +1 for the FLIP.


Best,
Leonard



[VOTE] FLIP-460: Display source/sink I/O metrics on Flink Web UI

2024-07-15 Thread Zhanghao Chen
Hi everyone,


Thanks for all the feedback about the FLIP-460: Display source/sink I/O metrics 
on Flink Web UI [1]. The discussion
thread is here [2]. I'd like to start a vote on it.

The vote will be open for at least 72 hours unless there is an objection or 
insufficient votes.

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355
[2] https://lists.apache.org/thread/sy271nhd2jr1r942f29xbvbgq7fsd841

Best,
Zhanghao Chen


[RESULT][VOTE] FLIP-460: Display source/sink I/O metrics on Flink Web UI

2024-07-18 Thread Zhanghao Chen
Hi everyone,

I'm delighted to announce that FLIP-460 [1] has been accepted.

There were 13 votes in favor:
- Yong Fang (binding)
- Robert Metzger (binding)
- Leonard Xu (binding)
- Becket Qin (binding)
- Ahmed Hamdy (non-binding)
- Aleksandr Pilipenko (non-binding)
- Xintong Song (binding)
- Rui Fan (binding)
- Zakelly Lan (binding)
- Yanquan Lv (non-binding)
- Hang Ruan (binding)
- Xiqian Yu (non-binding)
- Yuepeng Pan (non-binding)

There were no votes against.

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355

Best,
Zhanghao Chen


Re: [VOTE] FLIP-468: Introducing StreamGraph-Based Job Submission

2024-07-26 Thread Zhanghao Chen
Thanks for driving it. +1 (non-binding)

Best,
Zhanghao Chen

From: Junrui Lee 
Sent: Friday, July 26, 2024 11:02
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-468: Introducing StreamGraph-Based Job Submission

Hi everyone,

Thanks for all the feedback about FLIP-468: Introducing StreamGraph-Based
Job Submission [1]. The discussion thread can be found here [2].

The vote will be open for at least 72 hours unless there are any objections
or insufficient votes.

Best,

Junrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-468%3A+Introducing+StreamGraph-Based+Job+Submission

[2] https://lists.apache.org/thread/l65mst4prx09rmloydn64z1w1zp8coyz


Re: [ANNOUNCE] New Apache Flink Committer - Junrui Li

2024-11-05 Thread Zhanghao Chen
Congrats, Junrui!


Best,
Zhanghao Chen

From: Zhu Zhu 
Sent: Tuesday, November 5, 2024 19:59
To: dev ; Junrui Lee 
Subject: [ANNOUNCE] New Apache Flink Committer - Junrui Li

Hi everyone,

On behalf of the PMC, I'm happy to announce that Junrui Li has become a
new Flink Committer!

Junrui has been an active contributor to the Apache Flink project for two
years. He had been the driver and major developer of 8 FLIPs, contributed
100+ commits with tens of thousands of code lines.

His contributions mainly focus on enhancing Flink batch execution
capabilities, including enabling parallelism inference by default(FLIP-283),
supporting progress recovery after JM failover(FLIP-383), and supporting
adaptive optimization of logical execution plan (FLIP-468/469). Furthermore,
Junrui did a lot of work to improve Flink's configuration layer, addressing
technical debt and enhancing its user-friendliness. He is also active in
mailing lists, participating in discussions and answering user questions.

Please join me in congratulating Junrui Li for becoming an Apache Flink
committer.

Best,
Zhu (on behalf of the Flink PMC)


Re: [DISCUSS] Is it a bug that the AdaptiveScheduler does not prioritize releasing TaskManagers during downscaling in Application mode?

2025-02-04 Thread Zhanghao Chen
Thanks for the effort, Yuepeng! The approach LGTM.

Best,
Zhanghao Chen

From: Yuepeng Pan 
Sent: Saturday, January 25, 2025 21:36
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] Is it a bug that the AdaptiveScheduler does not 
prioritize releasing TaskManagers during downscaling in Application mode?

Hi all,

It hasn't received further responses to this email over the past few days,
and as such, we currently lack sufficient feedback to draw a definitive 
conclusion.
I'd like to proceed by merging both
approaches as much as possible to reach the final consensus.
I'll move forward with this plan to address
the issue and update the PR[1] accordingly:

> - It's agreeded to optimize/fix this issue in the 1.x TLS versions.
> - The primary goal of this optimization/fix is to minimize the number of 
> TaskManagers used in application mode only.
> - The optimized logic will be set as the default logic, and the original 
> logic will be retained when explicitly enabled through a parameter.

Please rest assured, I am not rushing ahead recklessly.
If there is anything inappropriate about this conclusion/measure or the 
reasoning supporting it,
I will promptly halt and revise the action.

> a. The default behavior of this new parameter (the optimized logic) aligns
> with the expected behavior of most application deployment mode users
> who would have this behavior enabled by default.
> Therefore, it doesn't add complexity in terms of configuration for them.
> b. However, this would require a small number of users who want to keep the 
> original behavior to actively configure this setting.
> So, this still gives users the flexibility to choose.
> c. Since this issue is only fixed by adopting this solution in version 1.x 
> LTS with application deployment mode,
> this parameter doesn't have a plan for forward compatibility, and a new 
> parameter would also be acceptable to me.

Thank you all very much for your attention and help.

Best,
Yuepeng Pan

[1] https://github.com/apache/flink/pull/25218


On 2025/01/20 01:56:36 Yuepeng Pan wrote:
> Hi, Maximilian, Rui, Matthias:
> Thanks for the response, which gives me a general understanding of your 
> proposed approach and its implementation outline.
>
> Hi, All:
> Thank you all very much for the discussion and suggestions.
>
> Based on the discussions we have received so far,
> we have reached a preliminary consensus:
>
> - When the job runs in an application cluster, the default behavior
> of AdaptiveScheduler not actively releasing Taskmanagers resources
> during downscaling could be considered a
> bug (At least from certain perspectives, this is the case).
> - We should fix it in flink 1.x.
>
> However, there's still no consensus in the discussion on how to fix this 
> issue under the following conditions:
> - Flink 1.x series versions and Application deployment mode (It's not about 
> to session cluster mode.)
>
> Strategy list:
>
> 1). Adding this behavior while being guarded by a feature flag/configuration 
> parameter in the 1.x LTS version.
> (@Matthias If my understanding is incorrect, please correct me, thanks! )
> a. This enables the option for users to revert to the original behavior
>eg. when ignoring idle resource occupation and focusing only
>on the resource waiting time during rescaling, this can achieve some 
> positive impacts.
> b. Introducing new parameters increases the complexity for users,
>as Maximilian mentioned, we already have many parameters.
>
> 2). Set the behavior as the default without introducing new parameters in the 
> 1.x LTS version.
> a. Avoid introducing new parameters and reduce complexity for users.
> b. This disables the option for users to revert to the original behavior.
>
> We have to seek some trade-offs/change between the two options above in order 
> to make a choice and reach a consensus on the conclusion.
>
> Although Option-1) increases the complexity for users to use, I prefer to 
> Option-1) due to the following reasons if we could set the default behavior 
> in Option - 1) to the new behavior:
> a. This new parameter aligns with the expected behavior for most 
> application deployment mode users,
>who would have this behavior enabled by default.
>Therefore, it doesn't add complexity in terms of configuration for 
> them.
> b. However, this would require users who want to keep the original 
> behavior to actively configure this setting.
>So, this still gives users the flexibility to choose.
> c. Since this issue is only fixed by adopting this solution in version 
> 1.x LTS with application deployment mode,
> this parameter doesn't have a pla

Re: [ANNOUNCE] New Apache Flink PMC Member - Zakelly Lan

2025-04-01 Thread Zhanghao Chen
Congrats, Zakelly!

Best,
Zhanghao Chen

From: Yuan Mei 
Sent: Tuesday, April 1, 2025 15:51
To: dev 
Subject: [ANNOUNCE] New Apache Flink PMC Member - Zakelly Lan

On behalf of the PMC, I'm very happy to announce a new Apache Flink PMC Member
- Zakelly Lan.

Zakelly has been a dedicated and invaluable contributor to the Flink
community for over five years. He became a committer early 2024 and is one
of our most active contributors in the last year.

Zakelly's expertise is in Flink state management and checkpointing. He has
developed several key features and FLIPs in this area. Notably, he has been
the driving force for the "Disaggregated State Management in Flink 2.0"
project, making it to a successful completion. Disaggregated State
Management is one of the most complicated features and important
improvements in 2.0.

He contributed and reviewed a significant amount of PRs and has been active
both on the mailing lists and in Jira helping to both maintain and grow
Apache Flink's community.
In addition, Zakelly is the sole maintainer of the Flink benchmark
environment and helps monitor performance regressions. Zakelly is also
continuously helping to expand the Flink community. At FFA 2024 Asia, his
presentation on "Introducing ForSt DB: The Disaggregated State Store for
Flink 2.0" was well-received and highlighted his deep technical expertise.

Please join me in welcoming and congratulating Zakelly!

Best,
Yuan Mei (on behalf of the Flink PMC)


Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput

2025-04-23 Thread Zhanghao Chen
Simply setting config pipeline.object-reuse=true should work for that.

Best,
Zhanghao Chen

From: Winterchill <809025...@qq.com.INVALID>
Sent: Wednesday, April 23, 2025 20:12
To: dev 
Subject: [DISCUSS] FlinkSQL support enableObjectReuse to optimize 
CopyingChainingOutput

During our analysis of FlinkSQL, we found that the FlinkSQL data stream heavily 
uses `CopyingChainingOutput` for operator-to-operator data transmission.


In the process, we observed that some operators, such as `WatermarkAssigner`, 
do not necessarily require the deep copy logic of `CopyingChainingOutput`. 
Additionally, we noticed that FlinkSQL does not support the `enableObjectReuse` 
parameter in `DataStream`.


Based on this, we have the following questions:
1. Will FlinkSQL support `enableObjectReuse` in the future?
2. We would like try to modify FlinkSQL's internal logic to eliminate 
unnecessary deep copies. How can we do this safely? Do you have any suggestions?


Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput

2025-04-24 Thread Zhanghao Chen
You will get into trouble if some Op caches the data locally and its downstream 
Op in the same operator chain modifies the data in-place. AFAIK, all Flink SQL 
operators handle it properly (deep copy when caching is needed), but you should 
be careful with UDFs and custom connectors.

Example: suppose we have two chained operators A->B with object reuse enabled,  
A caches the data in a local hash map, and B modifies data in-place. Now when a 
single row arrives, Op A first caches it in the map, the same object is 
directly passed to Op B, and Op B modifies it in-place. Note that the entry 
cached in Op A is also updated as object is reused.

Best,
Zhanghao Chen

From: Winterchill <809025...@qq.com.INVALID>
Sent: Thursday, April 24, 2025 19:01
To: dev 
Subject: Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize 
CopyingChainingOutput

Thanks.I read about it in docs and have some questions:


https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/


The Flink documentation indicates that enabling `enableObjectReuse` may cause 
bugs, but the description is not detailed enough. 
In what situations would it lead to bugs? Can we use this param in production 
environment?



---Original---
From: "Zhanghao Chen"

Re: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-23 Thread Zhanghao Chen
+1 (non-binding)

Thanks for driving this. It's a nice useability improvement for performing 
partial-updates on datalakes.


Best,
Zhanghao Chen

From: xiangyu feng 
Sent: Sunday, February 23, 2025 10:44
To: dev@flink.apache.org 
Subject: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in Planner

Hi all,

I would like to start the vote for FLIP-506: Support Reuse Multiple Table
Sinks in Planner[1].
This FLIP was discussed in this thread [2].

The vote will be open for at least 72 hours unless there is an objection or
insufficient votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
[2] https://lists.apache.org/thread/r1wo9sf3d1725fhwzrttvv56k4rc782m

Regards,
Xiangyu Feng


Re: [DISCUSSION] Upgrade to Kryo 5 for Flink 2.0

2025-02-25 Thread Zhanghao Chen
Hi Gyula,

Thanks for bringing this up! Definitely +1 for upgrading Kryo in Flink 2.0. As 
a side note, it might be useful to introduce customizable generic serializer 
support like Spark, where you can switch to your own serializer via the 
"spark.serializer" [1] option. Users starting new applications can introduce 
their own serialization stack in this case to resolve Java compatibility issue 
is this case or for other performance issues.

[1] https://spark.apache.org/docs/latest/configuration.html


Best,
Zhanghao Chen

From: Gyula F?ra 
Sent: Friday, February 21, 2025 14:04
To: dev 
Subject: [DISCUSSION] Upgrade to Kryo 5 for Flink 2.0

Hey all!

I would like to rekindle this discussion as it seems that it has stalled
several times in the past and we are nearing the point in time where the
decision has to be made with regards to 2.0. (we are already a bit late but
nevermind)

There has been numerous requests and efforts to upgrade Kryo to better
support newer Java versions and Java native types. I think we can all agree
that this change is inevitable one way or another.

The latest JIRA for this seems to be:
https://issues.apache.org/jira/browse/FLINK-3154

There is even an open PR that accomplishes this (currently in a state
incompatible way) but based on the discussion it seems that with some extra
complexity compatibility can even be preserved by having both the old and
new Kryo versions active at the same time.

The main question here is whether state compatibility is important for 2.0
with this regard or we want to bite the bullet and make this upgrade once
and for all.

Cheers,
Gyula


[jira] [Created] (FLINK-31936) Support setting scale up max factor

2023-04-25 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-31936:
-

 Summary: Support setting scale up max factor
 Key: FLINK-31936
 URL: https://issues.apache.org/jira/browse/FLINK-31936
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Zhanghao Chen


Currently, only scale down max factor is supported to be configured. We should 
also add a config for scale up max factor as well. In many cases, a job's 
performance won't improve after scaling up due to external bottlenecks. 
Although we can detect ineffective scaling up would block further scaling, but 
it already hurts if we scale too much in a single step which may even burn out 
external services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31991) Update Autoscaler doc to reflect the changes brought by the new source scaling logic

2023-05-03 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-31991:
-

 Summary: Update Autoscaler doc to reflect the changes brought by 
the new source scaling logic
 Key: FLINK-31991
 URL: https://issues.apache.org/jira/browse/FLINK-31991
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Zhanghao Chen
 Attachments: image-2023-05-03-20-16-33-704.png

The current statements on job requirements are outdated:

??- All sources must use the new Source API (most common connectors already 
do)??
??- Source scaling requires sources to expose the standardized connector 
metrics for accessing backlog information (source scaling can be disabled)??

The Autoscaler doc needs to be updated to reflect the changes brought by the 
new source scaling logic ([FLINK-31326|[FLINK-31326] Disabled source scaling 
breaks downstream scaling if source busyTimeMsPerSecond is 0 - ASF JIRA 
(apache.org)]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32037) The edge on Web UI is wrong after parallelism changes via parallelism overrides or AdaptiveScheduler

2023-05-08 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32037:
-

 Summary: The edge on Web UI is wrong after parallelism changes via 
parallelism overrides or AdaptiveScheduler 
 Key: FLINK-32037
 URL: https://issues.apache.org/jira/browse/FLINK-32037
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination, Runtime / REST
Affects Versions: 1.17.0
Reporter: Zhanghao Chen


*Background*

After FLINK-30213, in case of parallelism changes to the JobGraph, as done via 
the AdaptiveScheduler or through providing JobVertexId overrides in 
PipelineOptions#PARALLELISM_OVERRIDES, when the consumer parallelism doesn't 
match the local parallelism, the original ForwardPartitioner will be replaced 
with the RebalancePartitioner.

*Problem*

Although the actual partitioner changes underneath, the ship strategy seen on 
the Web UI is still FORWARD.

This is because the fix patch applies when we init StreamTask, and the job 
graph is not touched. Web UI uses the JSON plan generated from the job graph 
for display, and the ship strategy is get by JobEdge#getShipStrategyName.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32124) Add option to enable partition alignment for sources

2023-05-17 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32124:
-

 Summary: Add option to enable partition alignment for sources
 Key: FLINK-32124
 URL: https://issues.apache.org/jira/browse/FLINK-32124
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Zhanghao Chen


Currently, autoscaler did not consider balancing partitions among source tasks. 
In our production env, partition skew has proven to be a severe problem for 
many jobs. Especially in a job topology with all forward or rescale shuffles,  
partition skew on the source side can further lead to data imbalance in later 
operators.

We should add an option to enable partition alignment for sources for that, but 
making it disabled by default as this has a side effect in that partition usu. 
has limited factors and enabling alignment will greatly limit our scaling 
choices. Also, if data among partitions are imbalanced in the first place, 
partition alignment won't help as well (this is not a common case inside our 
company though).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32127) Source busy time is inaccurate in many cases

2023-05-17 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32127:
-

 Summary: Source busy time is inaccurate in many cases
 Key: FLINK-32127
 URL: https://issues.apache.org/jira/browse/FLINK-32127
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Zhanghao Chen


We found that source busy time is inaccurate in many cases. The reason is that 
sources are usu. multi-threaded (Kafka and RocketMq for example), there is a 
fetcher thread fetching data from data source, and a consumer thread 
deserializes data with an blocking queue in between. A source is considered 
 # *idle* if the consumer is blocked by fetching data from the queue
 # *backpressured* if the consumer is blocked by writing data to downstream 
operators
 # *busy* otherwise

However, this means that if the bottleneck is on the fetcher side, the consumer 
will be often blocked by fetching data from the queue, the source idle time 
would be high, but in fact it is busy and consumes a lot of CPU. In some of our 
jobs, the source max busy time is only ~600 ms while it is actually reaching 
the limit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32820) ParameterTool is mistakenly marked as deprecated

2023-08-09 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32820:
-

 Summary: ParameterTool is mistakenly marked as deprecated
 Key: FLINK-32820
 URL: https://issues.apache.org/jira/browse/FLINK-32820
 Project: Flink
  Issue Type: Improvement
  Components: API / DataSet, API / DataStream
Affects Versions: 1.18.0
Reporter: Zhanghao Chen


ParameterTool and AbstractParameterTool in pacakge flink-java is mistakenly 
marked as deprecated in [FLINK-32558] Properly deprecate DataSet API - ASF JIRA 
(apache.org). They are widely used for handling application parameters and is 
also listed in the Flink user doc: 
[https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/.|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/]
 Also, they are not directly related to Dataset API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32821) Streaming examples failed to execute due to error in packaging

2023-08-09 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32821:
-

 Summary: Streaming examples failed to execute due to error in 
packaging
 Key: FLINK-32821
 URL: https://issues.apache.org/jira/browse/FLINK-32821
 Project: Flink
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.18.0
Reporter: Zhanghao Chen


5 out of the 7 streaming examples failed to run:
 * Iteration & SessionWindowing & SocketWindowWordCount & WindowJoin failed to 
run due to java.lang.NoClassDefFoundError: 
org/apache/flink/streaming/examples/utils/ParameterTool
 * TopSpeedWindowing failed to run due to: Caused by: 
java.lang.ClassNotFoundException: 
org.apache.flink.connector.datagen.source.GeneratorFunction

The NoClassDefFoundError with ParameterTool is introduced by [FLINK-32558] 
Properly deprecate DataSet API - ASF JIRA (apache.org), and we'd better resolve 
[FLINK-32820] ParameterTool is mistakenly marked as deprecated - ASF JIRA 
(apache.org) first before we come to a fix for this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32822) Add connector option to control whether to enable auto-commit of offsets when checkpoints is enabled

2023-08-09 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32822:
-

 Summary: Add connector option to control whether to enable 
auto-commit of offsets when checkpoints is enabled
 Key: FLINK-32822
 URL: https://issues.apache.org/jira/browse/FLINK-32822
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Kafka
Reporter: Zhanghao Chen


When checkpointing is enabled, Flink Kafka connector commits the current 
consuming offset when checkpoints are *completed* although ** Kafka source does 
*NOT* rely on committed offsets for fault tolerance. When the checkpoint 
interval is long, the lag curve will behave in a zig-zag way: the lag will keep 
increasing, and suddenly drops on a complete checkpoint. It have led to some 
confusion for users as in 
[https://stackoverflow.com/questions/76419633/flink-kafka-source-commit-offset-to-error-offset-suddenly-increase-or-decrease]
 and may also affect external monitoring for setting up alarms (you'll have to 
set up with a high threshold due to the non-realtime commit of offsets) and 
autoscaling (the algorithm would need to pay extra effort to distinguish 
whether the backlog is actually growing or just because the checkpoint is not 
completed yet).

Therefore, I think it is worthwhile to add an option to enable auto-commit of 
offsets when checkpoints is enabled. For DataStream API, it will be adding a 
configuration method. For Table API, it will be adding a new connector option 
which wires to the DataStream API configuration underneath.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32868) Document the need to backport FLINK-30213 for using autoscaler with older version Flinks

2023-08-14 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32868:
-

 Summary: Document the need to backport FLINK-30213 for using 
autoscaler with older version Flinks
 Key: FLINK-32868
 URL: https://issues.apache.org/jira/browse/FLINK-32868
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Zhanghao Chen


The current Autoscaler doc states on job requirements as the following:

Job requirements:
 * The autoscaler currently only works with the latest [Flink 
1.17|https://hub.docker.com/_/flink] or after backporting the following fixes 
to your 1.15/1.16 Flink image
 ** [Job vertex parallelism 
overrides|https://github.com/apache/flink/commit/23ce2281a0bb4047c64def9af7ddd5f19d88e2a9]
 (must have)
 ** [Support timespan for busyTime 
metrics|https://github.com/apache/flink/commit/a7fdab8b23cddf568fa32ee7eb804d7c3eb23a35]
 (good to have)

However, https://issues.apache.org/jira/browse/FLINK-30213 is also crucial and 
need to be backported to 1.15/1.16 to enable autoscaling. We should add it to 
the doc as well, and marked as must have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32872) Add option to control the default partitioner when the parallelism of upstream and downstream operator does not match

2023-08-15 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32872:
-

 Summary: Add option to control the default partitioner when the 
parallelism of upstream and downstream operator does not match
 Key: FLINK-32872
 URL: https://issues.apache.org/jira/browse/FLINK-32872
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Configuration
Affects Versions: 1.17.0
Reporter: Zhanghao Chen


*Problem*

Currently, when the no partitioner is specified, FORWARD partitioner is used if 
the parallelism of upstream and downstream operator matches, REBALANCE 
partitioner used otherwise. However, this behavior is not configurable and can 
be undesirable in certain cases:
 #  REBALANCE partitioner will create an all-to-all connection between upstream 
and downstream operators and consume a lot of extra CPU and memory resources 
when the parallelism is high in pipelining mode and RESCALE partitioner is 
desirable in this case.
 # For Flink SQL jobs, users cannot specify the partitioner directly so far. 
And for DataStream jobs, users may not want to explicitly set the partitioner 
everywhere.

*Proposal*

Add an option to control the default partitioner when the parallelism of 
upstream and downstream operator does not match. The option can have the name 
"pipeline.default-partioner-with-unmatched-parallelism" with REBALANCE as the 
default value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32980) Support env.java.opts.all & env.java.opts.cli config for starting Session clusters

2023-08-28 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32980:
-

 Summary: Support env.java.opts.all & env.java.opts.cli config for 
starting Session clusters
 Key: FLINK-32980
 URL: https://issues.apache.org/jira/browse/FLINK-32980
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client, Deployment / Scripts
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0


*Problem*

The following configs are supposed to be supported:
|h5. env.java.opts.all|(none)|String|Java options to start the JVM of all Flink 
processes with.|
|h5. env.java.opts.client|(none)|String|Java options to start the JVM of the 
Flink Client with.|

However, the two configs do not take effect for starting Flink session clusters 
using kubernetes-session.sh and yarn-session.sh. This can lead to problems in 
complex production envs. For example, in my company, some nodes are IPv6-only, 
and the connection between Flink client and K8s/YARN control plane is via a 
domain name whose backend is on IPv4/v6 dual stack, and the JVM arg 
-Djava.net.preferIPv6Addresses=true needs to be set to make Java connect to 
IPv6 addresses in favor of IPv4 ones otherwise the K8s/YARN control plane is 
inaccessible.

 

*Proposal*

The fix is straight-forward, simply apply the following changes to the session 
scripts:

`
# Add Client-specific JVM options
FLINK_ENV_JAVA_OPTS="${FLINK_ENV_JAVA_OPTS} ${FLINK_ENV_JAVA_OPTS_CLI}"
 
exec $JAVA_RUN $JVM_ARGS $FLINK_ENV_JAVA_OPTS xxx
`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32983) Support setting env.java.opts.all & env.java.opts.cli configs via dynamic properties on the CLI side

2023-08-28 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-32983:
-

 Summary: Support setting env.java.opts.all & env.java.opts.cli 
configs via dynamic properties on the CLI side
 Key: FLINK-32983
 URL: https://issues.apache.org/jira/browse/FLINK-32983
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client, Deployment / Scripts
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0


*Problem*

The following configs are supposed to be supported:
|h5. env.java.opts.all|(none)|String|Java options to start the JVM of all Flink 
processes with.|
|h5. env.java.opts.client|(none)|String|Java options to start the JVM of the 
Flink Client with.|

However, the two configs only takes effect on the Client side when they are set 
in the flink-conf files. In other words, configs set via -D or-yD on the CLI 
will not take effect, which is counter-intuitive and makes configuration less 
flexible.

 

*Proposal*

Add logic to parse configs set via -D or-yD in config.sh and make them has a 
higher precedence over configs set in flink-conf.yaml for env.java.opts.all & 
env.java.opts.client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33123) Wrong dynamic replacement of partitioner from FORWARD to REBLANCE for autoscaler and adaptive scheduler and

2023-09-20 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33123:
-

 Summary: Wrong dynamic replacement of partitioner from FORWARD to 
REBLANCE for autoscaler and adaptive scheduler  and 
 Key: FLINK-33123
 URL: https://issues.apache.org/jira/browse/FLINK-33123
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Runtime / Coordination
Affects Versions: 1.17.0, 1.18.0
Reporter: Zhanghao Chen


*Background*

https://issues.apache.org/jira/browse/FLINK-30213 reported that the edge is 
wrong when the parallelism is changed for a vertex with a FORWARD edge, which 
is used by both the autoscaler and adaptive scheduler where one can change the 
vertex parallelism dynamically. Fix is applied to dynamically replace 
partitioner from FORWARD to REBLANCE on task deployment in {{{}StreamTask{}}}: 

    {{private static void 
replaceForwardPartitionerIfConsumerParallelismDoesNotMatch(}}
{{            Environment environment, NonChainedOutput streamOutput) {}}
{{            Environment environment, NonChainedOutput streamOutput, int 
outputIndex) {}}
{{        if (streamOutput.getPartitioner() instanceof ForwardPartitioner}}
{{                && streamOutput.getConsumerParallelism()}}
{{                && 
environment.getWriter(outputIndex).getNumberOfSubpartitions()}}
{{                        != 
environment.getTaskInfo().getNumberOfParallelSubtasks()) {}}
{{            LOG.debug(}}
{{                    "Replacing forward partitioner with rebalance for {}",}}
{{                    environment.getTaskInfo().getTaskNameWithSubtasks());}}
{{            streamOutput.setPartitioner(new RebalancePartitioner<>());}}
{{        }}}
{{    }}}

*Problem*

Unfortunately, the fix is still buggy in two aspects:
 # The connections between upstream and downstream tasks are determined by the 
distribution type of the partitioner when generating execution graph on the JM 
side. When the edge is FORWARD, the distribution type is POINTWISE, and Flink 
will try to evenly distribute subpartitions to all downstream tasks. If one 
want to change it to REBALANCE, the distribution type has to be changed to 
ALL_TO_ALL to make all-to-all connections between upstream and downstream 
tasks. However, the fix did not change the distribution type which makes the 
network connections be set up in a wrong way.
 # The FOWARD partitioner will be replaced if 
environment.getWriter(outputIndex).getNumberOfSubpartitions() equals to the 
task parallelism. However, the number of subpartitions here equals to the 
number of downstream tasks of this particular task, which is also determined by 
the distribution type of the partitioner when generating execution graph on the 
JM side.  When ceil(downstream task parallelism / upstream task parallelism) = 
upstream task parallelism, we will have the number of subpartitions = task 
parallelism. In fact, for a normal job with a FORWARD edge without any 
autoscaling action, you will find that the partitioner is changed to REBALANCE 
internally as the number of subpartitions always equals to 1 in this case.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33146) FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI

2023-09-24 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33146:
-

 Summary: FLIP-363: Unify the Representation of TaskManager 
Location in REST API and Web UI
 Key: FLINK-33146
 URL: https://issues.apache.org/jira/browse/FLINK-33146
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / REST, Runtime / Web Frontend
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0


Umbrella ticket for [FLIP-363: Unify the Representation of TaskManager Location 
in REST API and Web 
UI|https://cwiki.apache.org/confluence/display/FLINK/FLIP-363%3A+Unify+the+Representation+of+TaskManager+Location+in+REST+API+and+Web+UI].
 This is a continuation of [FLINK-25371] Include data port as part of the host 
info for subtask detail panel on Web UI - ASF JIRA (apache.org).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33147) Introduce a new "endpoint" field in REST API to represent TaskManager endpoint (host + port) and deprecate the "host" field

2023-09-24 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33147:
-

 Summary: Introduce a new "endpoint" field in REST API to represent 
TaskManager endpoint (host + port) and deprecate the "host" field
 Key: FLINK-33147
 URL: https://issues.apache.org/jira/browse/FLINK-33147
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / REST
Affects Versions: 1.18.0
    Reporter: Zhanghao Chen
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33148) Update Web UI to adopt the new "endpoint" field in REST API

2023-09-24 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33148:
-

 Summary: Update Web UI to adopt the new "endpoint" field in REST 
API
 Key: FLINK-33148
 URL: https://issues.apache.org/jira/browse/FLINK-33148
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Web Frontend
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33166) Support setting root logger level by config

2023-09-27 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33166:
-

 Summary: Support setting root logger level by config
 Key: FLINK-33166
 URL: https://issues.apache.org/jira/browse/FLINK-33166
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Zhanghao Chen


Users currently cannot change logging level by config and have to modify the 
cumbersome logger configuration file manually. We'd better provide a shortcut 
and support setting root logger level by config.

There're a number configs already to set logger configurations, like 
{{env.log.dir}} for logging dir, {{env.log.max}} for max number of old logging 
file to save. We can name the new config {{{}env.log.level{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33204) Add description for missing options in the all jobmanager/taskmanager options section in document

2023-10-07 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33204:
-

 Summary: Add description for missing options in the all 
jobmanager/taskmanager options section in document
 Key: FLINK-33204
 URL: https://issues.apache.org/jira/browse/FLINK-33204
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Configuration
Affects Versions: 1.17.0, 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0


There are 4 options which are excluded from the all jobmanager/taskmanager 
options section in the configuration document:
 # taskmanager.bind-host
 # taskmanager.rpc.bind-port
 # jobmanager.bind-host
 # jobmanager.rpc.bind-port

We should add them to the document under the all  jobmanager/taskmanager 
options section for completeness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33205) Replace Akka with Pekko in the description of "pekko.ssl.enabled"

2023-10-07 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33205:
-

 Summary: Replace Akka with Pekko in the description of 
"pekko.ssl.enabled"
 Key: FLINK-33205
 URL: https://issues.apache.org/jira/browse/FLINK-33205
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Configuration
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33221) Add config options for administrator JVM options

2023-10-09 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33221:
-

 Summary: Add config options for administrator JVM options
 Key: FLINK-33221
 URL: https://issues.apache.org/jira/browse/FLINK-33221
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Zhanghao Chen


We encounter similar issues described in SPARK-23472. Users may need to add JVM 
options to their Flink applications (e.g. to tune GC options). They typically 
use {{env.java.opts.x}} series of options to do so. We also have a set of 
administrator JVM options to apply by default, e.g. to enable GC log, tune GC 
options, etc. Both use cases will need to set the same series of options and 
will clobber one another.

In the past, we generated and pretended to the administrator JVM options in the 
Java code for generating the starting command for JM/TM. However, this has been 
proven to be difficult to maintain.

Therefore, I propose to also add a set of default JVM options for administrator 
use that prepends the user-set extra JVM options. We can mark the existing 
{{env.java.opts.x}} series as user-set extra JVM options and add a set of new 
{{env.java.opts.x.default}} options for administrator use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33236) Remove the unused high-availability.zookeeper.path.running-registry option

2023-10-10 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33236:
-

 Summary: Remove the unused 
high-availability.zookeeper.path.running-registry option
 Key: FLINK-33236
 URL: https://issues.apache.org/jira/browse/FLINK-33236
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Configuration
Affects Versions: 1.18.0
Reporter: Zhanghao Chen
 Fix For: 1.19.0


The running registry subcomponent of Flink HA has been removed in 
[FLINK-25430|https://issues.apache.org/jira/browse/FLINK-25430] and the 
"high-availability.zookeeper.path.running-registry" option is of no use after 
that. We should remove the option and regenerate the config doc to remove the 
relevant descriptions to avoid user's confusion. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33240) Generate docs for deprecated options as well

2023-10-11 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33240:
-

 Summary: Generate docs for deprecated options as well
 Key: FLINK-33240
 URL: https://issues.apache.org/jira/browse/FLINK-33240
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Zhanghao Chen
 Fix For: 1.19.0


Currently, Flink will skip doc generation for deprecated options (See 
{{{}ConfigOptionsDocGenerator#{}}}{{{}shouldBeDocumented{}}}). As a result, the 
deprecated options can no longer be found in the new version of Flink document. 
This might confuse users upgrading from an older version of Flink and they have 
to either carefully read the release notes or check the source code for 
upgrading guidance on deprecated options. I suggest generating doc for 
deprecated options as well, and we should scan through the code to make sure 
that proper upgrading guidance is provided for the deprecated options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33261) FLIP-367: Support Setting Parallelism for Table/SQL Sources

2023-10-12 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33261:
-

 Summary: FLIP-367: Support Setting Parallelism for Table/SQL 
Sources
 Key: FLINK-33261
 URL: https://issues.apache.org/jira/browse/FLINK-33261
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Reporter: Zhanghao Chen


Umbrella issue for [FLIP-367: Support Setting Parallelism for Table/SQL Sources 
- Apache Flink - Apache Software 
Foundation|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33262) Extend source provider interfaces with the new parallelism provider interface

2023-10-12 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33262:
-

 Summary: Extend source provider interfaces with the new 
parallelism provider interface
 Key: FLINK-33262
 URL: https://issues.apache.org/jira/browse/FLINK-33262
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Zhanghao Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33263) Implement ParallelismProvider for sources in Blink planner

2023-10-12 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33263:
-

 Summary: Implement ParallelismProvider for sources in Blink planner
 Key: FLINK-33263
 URL: https://issues.apache.org/jira/browse/FLINK-33263
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Zhanghao Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33264) Support source parallelism setting for DataGen connector

2023-10-12 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33264:
-

 Summary: Support source parallelism setting for DataGen connector
 Key: FLINK-33264
 URL: https://issues.apache.org/jira/browse/FLINK-33264
 Project: Flink
  Issue Type: Sub-task
Reporter: Zhanghao Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33265) Support source parallelism setting for Kafka connector

2023-10-12 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33265:
-

 Summary: Support source parallelism setting for Kafka connector
 Key: FLINK-33265
 URL: https://issues.apache.org/jira/browse/FLINK-33265
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Kafka
Reporter: Zhanghao Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33681) Display source/sink numRecordsIn/Out & numBytesIn/Out on UI

2023-11-28 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33681:
-

 Summary: Display source/sink numRecordsIn/Out & numBytesIn/Out on 
UI
 Key: FLINK-33681
 URL: https://issues.apache.org/jira/browse/FLINK-33681
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Reporter: Zhanghao Chen
 Attachments: image-2023-11-29-13-26-15-176.png

Currently, the numRecordsIn & numBytesIn metrics for sources and the 
numRecordsOut & numBytesOut metrics for sinks are always 0 on the Flink web 
dashboard.

[FLINK-11576|https://issues.apache.org/jira/browse/FLINK-11576] brings us these 
metrics on the opeartor level, but it does not integrate them on the task 
level. On the other hand, the summay metrics on the job overview page is based 
on the task level I/O metrics. As a result, even though new connectors 
supporting FLIP-33 metrics will report operator-level I/O metrics, we still 
cannot see the metrics on dashboard.

This ticket serves as an umbrella issue to integrate standard source/sink I/O 
metrics with the corresponding task I/O metrics. 

!image-2023-11-29-13-26-15-176.png|width=590,height=252!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33682) Reuse source operator input records/bytes metrics for SourceOperatorStreamTask

2023-11-28 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33682:
-

 Summary: Reuse source operator input records/bytes metrics for 
SourceOperatorStreamTask
 Key: FLINK-33682
 URL: https://issues.apache.org/jira/browse/FLINK-33682
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Metrics
Reporter: Zhanghao Chen


For SourceOperatorStreamTask, source opeartor is the head operator that takes 
input. We can directly reuse source operator input records/bytes metrics for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33891) Remove the obsolete SingleJobGraphStore

2023-12-19 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33891:
-

 Summary: Remove the obsolete SingleJobGraphStore
 Key: FLINK-33891
 URL: https://issues.apache.org/jira/browse/FLINK-33891
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Zhanghao Chen


SingleJobGraphStore was introduced a long time ago in FLIP-6. It is only used 
in a test case in DefaultDispatcherRunnerITCase#
leaderChange_withBlockingJobManagerTermination_doesNotAffectNewLeader. We can 
replace it with TestingJobGraphStore there and then safely remove the class. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33940) Update Update the auto-derivation rule of max parallelism for enlarged upscaling space

2023-12-25 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33940:
-

 Summary: Update Update the auto-derivation rule of max parallelism 
for enlarged upscaling space
 Key: FLINK-33940
 URL: https://issues.apache.org/jira/browse/FLINK-33940
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Reporter: Zhanghao Chen


*Background*

The choice of the max parallelism of an stateful operator is important as it 
limits the upper bound of the parallelism of the opeartor while it can also add 
extra overhead when being set too large. Currently, the max parallelism of an 
opeartor is either fixed to a value specified by API core / pipeline option or 
auto-derived with the following rules:

`min(max(roundUpToPowerOfTwo(operatorParallelism * 1.5), 128), 32767)`

*Problem*

Recently, the elasticity of Flink jobs is becoming more and more valued by 
users. The current auto-derived max parallelism was introduced a time time ago 
and only allows the operator parallelism to be roughly doubled, which is not 
desired for elasticity. Setting an max parallelism manually may not be desired 
as well: users may not have the sufficient expertise to select a good 
max-parallelism value.

*Proposal*

Update the auto-derivation rule of max parallelism to derive larger max 
parallelism for better elasticity experience out of the box. A candidate is as 
follows:

`min(max(roundUpToPowerOfTwo(operatorParallelism * {*}5{*}), {*}1024{*}), 
32767)`

Looking forward to your opinions on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33962) Chaining-agnostic OperatorID generation for improved state compatibility on parallelism change

2024-01-01 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33962:
-

 Summary: Chaining-agnostic OperatorID generation for improved 
state compatibility on parallelism change
 Key: FLINK-33962
 URL: https://issues.apache.org/jira/browse/FLINK-33962
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Reporter: Zhanghao Chen


*Background*

Flink restores opeartor state from snapshots based on matching the operatorIDs. 
Since Flink 1.2, {{StreamGraphHasherV2}} is used for operatorID generation when 
no user-set uid exist. The generated OperatorID is deterministic with respect 
to:
 * node-local properties (the traverse ID in the BFS for the DAG)
 * chained output nodes
 * input nodes hashes

*Problem*

The chaining behavior will affect state compatibility, as the generation of the 
OperatorID of an Op is dependent on its chained output nodes. For example, a 
simple source->sink DAG with source and sink chained together is state 
imcompatible with an otherwise identical DAG with source and sink unchained 
(either because the parallelisms of the two ops are changed to be unequal or 
chaining is disabled). This greatly limits the flexibility to perform 
chain-breaking/joining for performance tuning.

*Proposal*

Introduce ** {{StreamGraphHasherV3}} that is agnostic to the chaining behavior 
of operators, which effectively just removes L227-235 of 
[flink/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java
 at master · apache/flink 
(github.com)|https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java].
 

This will not hurt the deteministicity of the ID generation across job 
submission as long as the stream graph topology doesn't change, and since new 
versions of Flink have already adopted pure operator-level state recovery, this 
will not break state recovery across job submission as long as both submissions 
use the same hasher.

This will, however, breaks cross-version state compatibility. So we can 
introduce a new option to enable using HasherV3 in v1.19 and consider making it 
the default hasher in v2.0.

Looking forward to suggestions on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33977) Adaptive scheduler may not minimize the number of TMs during downscaling

2024-01-03 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-33977:
-

 Summary: Adaptive scheduler may not minimize the number of TMs 
during downscaling
 Key: FLINK-33977
 URL: https://issues.apache.org/jira/browse/FLINK-33977
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Zhanghao Chen


Adaptive Scheduler uses SlotAssigner to assign free slots to slot sharing 
groups. Currently, there're two implementations of SlotAssigner available: the 
DefaultSlotAssigner that treats all slots and slot sharing groups equally and 
the {color:#172b4d}StateLocalitySlotAssigner{color} that assigns slots based on 
the number of local key groups to utilize local state recovery. The scheduler 
will use the DefaultSlotAssigner when no key group assignment info is available 
and use the StateLocalitySlotAssigner otherwise.
 
However, none of the SlotAssigner targets at minimizing the number of TMs, 
which may produce suboptimal slot assignment under the Application Mode. For 
example, when a job with 8 slot sharing groups and 2 TMs (each 4 slots) is 
downscaled through the FLIP-291 API to have 4 slot sharing groups instead, the 
cluster may still have 2 TMs, one with 1 free slot, and the other with 3 free 
slots. For end-users, this implies an ineffective downscaling as the total 
cluster resources are not reduced.
 
We should take minimizing number of TMs into consideration as well. A possible 
approach is to enhance the {color:#172b4d}StateLocalitySlotAssigner: when the 
number of free slots exceeds need, sort all the TMs by a score summing from the 
allocation scores of all slots on it, remove slots from the excessive TMs with 
the lowest score and proceed the remaining slot assignment.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >