Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren
Congratulations, Qingsheng! Best, Zhanghao Chen From: Shammon FY Sent: Sunday, April 23, 2023 17:22 To: dev@flink.apache.org Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren Congratulations, Qingsheng! Best, Shammon FY On Sun, Apr 23, 2023 at 4:40 PM Weihua Hu wrote: > Congratulations, Qingsheng! > > Best, > Weihua > > > On Sun, Apr 23, 2023 at 3:53 PM Yun Tang wrote: > > > Congratulations, Qingsheng! > > > > Best > > Yun Tang > > > > From: weijie guo > > Sent: Sunday, April 23, 2023 14:50 > > To: dev@flink.apache.org > > Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren > > > > Congratulations, Qingsheng! > > > > Best regards, > > > > Weijie > > > > > > Geng Biao 于2023年4月23日周日 14:29写道: > > > > > Congrats, Qingsheng! > > > Best, > > > Biao Geng > > > > > > 获取 Outlook for iOS<https://aka.ms/o0ukef> > > > > > > 发件人: Wencong Liu > > > 发送时间: Sunday, April 23, 2023 11:06:39 AM > > > 收件人: dev@flink.apache.org > > > 主题: Re:[ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren > > > > > > Congratulations, Qingsheng! > > > > > > Best, > > > Wencong LIu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-04-21 19:47:52, "Jark Wu" wrote: > > > >Hi everyone, > > > > > > > >We are thrilled to announce that Leonard Xu has joined the Flink PMC! > > > > > > > >Leonard has been an active member of the Apache Flink community for > many > > > >years and became a committer in Nov 2021. He has been involved in > > various > > > >areas of the project, from code contributions to community building. > His > > > >contributions are mainly focused on Flink SQL and connectors, > especially > > > >leading the flink-cdc-connectors project to receive 3.8+K GitHub > stars. > > He > > > >authored 150+ PRs, and reviewed 250+ PRs, and drove several FLIPs > (e.g., > > > >FLIP-132, FLIP-162). He has participated in plenty of discussions in > the > > > >dev mailing list, answering questions about 500+ threads in the > > > >user/user-zh mailing list. Besides that, he is community minded, such > as > > > >being the release manager of 1.17, verifying releases, managing > release > > > >syncs, etc. > > > > > > > >Congratulations and welcome Leonard! > > > > > > > >Best, > > > >Jark (on behalf of the Flink PMC) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-04-21 19:50:02, "Jark Wu" wrote: > > > >Hi everyone, > > > > > > > >We are thrilled to announce that Qingsheng Ren has joined the Flink > PMC! > > > > > > > >Qingsheng has been contributing to Apache Flink for a long time. He is > > the > > > >core contributor and maintainer of the Kafka connector and > > > >flink-cdc-connectors, bringing users stability and ease of use in both > > > >projects. He drove discussions and implementations in FLIP-221, > > FLIP-288, > > > >and the connector testing framework. He is continuously helping with > the > > > >expansion of the Flink community and has given several talks about > Flink > > > >connectors at many conferences, such as Flink Forward Global and Flink > > > >Forward Asia. Besides that, he is willing to help a lot in the > community > > > >work, such as being the release manager for both 1.17 and 1.18, > > verifying > > > >releases, and answering questions on the mailing list. > > > > > > > >Congratulations and welcome Qingsheng! > > > > > > > >Best, > > > >Jark (on behalf of the Flink PMC) > > > > > >
Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu
Congratulations, Leonard! Best, Zhanghao Chen From: Shammon FY Sent: Sunday, April 23, 2023 17:22 To: dev@flink.apache.org Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu Congratulations, Leonard! Best, Shammon FY On Sun, Apr 23, 2023 at 5:07 PM Xianxun Ye wrote: > Congratulations, Leonard! > > Best regards, > > Xianxun > > > 2023年4月23日 09:10,Lincoln Lee 写道: > > > > Congratulations, Leonard! > >
Re: Permission issue : Flink Code Push
Hi Archit, The procedure is: 1. Fork the flink repo 2. Implement your change and push to your fork 3. Create a pull request from your fork 4. Wait for reviews You can refer to https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork for more details. [https://github.githubassets.com/images/modules/open_graph/github-logo.png]<https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork> Creating a pull request from a fork - GitHub Docs<https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork> You can create a pull request to propose changes you've made to a fork of an upstream repository. docs.github.com Best, Zhanghao Chen From: Archit Goyal Sent: Wednesday, May 3, 2023 1:50 To: dev@flink.apache.org Subject: Permission issue : Flink Code Push Hi, I am new to the Flink community and am working on the FLINK-12869<https://issues.apache.org/jira/browse/FLINK-12869>. I have a PR to be uploaded but I am unable to do push. I get a 403 error: argoyal@argoyal flink % git push --set-upstream origin YarnAcl remote: Permission to apache/flink.git denied to architgyl. fatal: unable to access 'https://github.com/apache/flink.git/': The requested URL returned error: 403 Looks like it is a permission issue. Can someone please grant me access so that I can raise a PR. Thanks, Archit Goyal
[VOTE] FLIP-367: Support Setting Parallelism for Table/SQL Sources
Hi All, Thanks for all the feedback on FLIP-367: Support Setting Parallelism for Table/SQL Sources [1][2]. I'd like to start a vote for FLIP-367. The vote will be open until Oct 12th 12:00 PM GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150 [2] https://lists.apache.org/thread/gtpswl42jzv0c9o3clwqskpllnw8rh87 Best, Zhanghao Chen
Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers
Hi Yun and Yu, Thanks for driving this. This would definitely help users identify performance bottlenecks, especially for the cases where the bottleneck lies in the system stack (e.g. GC), and big +1 for the downloadable flamegraph to ease sharing. I'm wondering if we could add this for the job manager as well. In the OLAP scenario and sometimes in the streaming scenario (when there're some heavy operations during execution plan generation or in operator coordinators), the JM can have bottleneck as well. Best, Zhanghao Chen From: Yu Chen Sent: Monday, October 9, 2023 17:24 To: dev@flink.apache.org Subject: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers Hi all, Yun Tang and I are opening this thread to discuss our proposal to integrate async-profiler's capabilities for profiling taskmananger (e.g., generating flame graphs) in the Flink Web [1]. Currently, Flink provides ThreadDump and Operator-Level Flame Graphs by sampling task threads. The results generated in such way missing the relevant stack of java threads and system calls. The async-profiler[2] is a low-overhead sampling profiler for Java, but the steps to use it in the production environment are cumbersome and suffer from permissions and security risks. Therefore, we propose adding rest APIs to provide the capability to invoke async-profiler on multiple platforms through JNI, which can be easily operated on Web UI. This enhancement will improve the efficiency and experience of Flink users in identifying performance bottlenecks. Please refer to the FLIP document for more details about the proposed design and implementation. We welcome any feedback and opinions on this proposal. [1] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers - Apache Flink - Apache Software Foundation <https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers> [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP profiler for Java featuring AsyncGetCallTrace + perf_events <https://github.com/async-profiler/async-profiler> Best regards, Yun Tang and Yu Chen
[DISCUSS][FLINK-33240] Document deprecated options as well
Hi Flink users and developers, Currently, Flink won't generate doc for the deprecated options. This might confuse users when upgrading from an older version of Flink: they have to either carefully read the release notes or check the source code for upgrade guidance on deprecated options. I propose to document deprecated options as well, with a "(deprecated)" tag placed at the beginning of the option description to highlight the deprecation status [1]. Looking forward to your feedbacks on it. [1] https://issues.apache.org/jira/browse/FLINK-33240 Best, Zhanghao Chen
[RESULT][VOTE] FLIP-367: Support Setting Parallelism for Table/SQL Sources
Hi devs, Thanks everyone for your review and the votes! I am happy to announce that FLIP-367: Support Setting Parallelism for Table/SQL Sources [1] has been accepted. There are 10 binding votes and 5 non-binding votes [2]: - Benchao Li (binding) - Shammon FY (binding) - Weihua Hu (binding) - Yun Tang (binding) - Yangze Guo (binding) - Feng Jing - xiangyu feng - Ahmed Hamd - Jing Ge (binding) - Leonard Xu (binding) - Lincoln Lee (binding) - Sergey Nuyanzin (binding) - Martijn Visser (binding) - Samrat Deb - Jane Chan There is no disapproving vote. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150 [2] https://lists.apache.org/thread/bzlcw28jn0xx1hh45q0ry8wnxf0xoptg Best, Zhanghao Chen
Re: [VOTE] FLIP-366: Support standard YAML for FLINK configuration
+1 (non-binding) Best, Zhanghao Chen From: Junrui Lee Sent: Friday, October 13, 2023 10:12 To: dev@flink.apache.org Subject: [VOTE] FLIP-366: Support standard YAML for FLINK configuration Hi all, Thank you to everyone for the feedback on FLIP-366[1]: Support standard YAML for FLINK configuration in the discussion thread [2]. I would like to start a vote for it. The vote will be open for at least 72 hours (excluding weekends, unless there is an objection or an insufficient number of votes). Thanks, Junrui [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-366%3A+Support+standard+YAML+for+FLINK+configuration [2]https://lists.apache.org/thread/qfhcm7h8r5xkv38rtxwkghkrcxg0q7k5
Re: [VOTE] FLIP-375: Built-in cross-platform powerful java profiler
+1 (non-binding) Best, Zhanghao Chen From: Yu Chen Sent: Tuesday, October 17, 2023 15:51 To: dev@flink.apache.org Cc: myas...@live.com Subject: [VOTE] FLIP-375: Built-in cross-platform powerful java profiler Hi all, Thank you to everyone for the feedback and detailed comments on FLIP-375[1]. Based on the discussion thread [2], I think we are ready to take a vote to contribute this to Flink. I'd like to start a vote for it. The vote will be open for at least 72 hours (excluding weekends, unless there is an objection or an insufficient number of votes). Thanks, Yu Chen [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler [2] https://lists.apache.org/thread/tp5vqgspsdko66dr6vm7cgtod9k2pct7
Re: [DISCUSS][FLINK-33240] Document deprecated options as well
Hi Alexander, I haven't done a complete analysis yet. But through simple code search, roughly 35 options would be added with this change. Also note that some old options defined in a ConfigConstant class won's be added here as flink-doc won't discover these constant-based options. Best, Zhanghao Chen From: Alexander Fedulov Sent: Tuesday, October 31, 2023 18:12 To: dev@flink.apache.org Cc: u...@flink.apache.org Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well Hi Zhanghao, Thanks for the proposition. In general +1, this sounds like a good idea as long it is clear that the usage of these settings is discouraged. Just one minor concern - the configuration page is already very long, do you have a rough estimate of how many more options would be added with this change? Best, Alexander Fedulov On Mon, 30 Oct 2023 at 18:24, Matthias Pohl wrote: Thanks for your proposal, Zhanghao Chen. I think it adds more transparency to the configuration documentation. +1 from my side on the proposal On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen mailto:zhanghao.c...@outlook.com>> wrote: > Hi Flink users and developers, > > Currently, Flink won't generate doc for the deprecated options. This might > confuse users when upgrading from an older version of Flink: they have to > either carefully read the release notes or check the source code for > upgrade guidance on deprecated options. > > I propose to document deprecated options as well, with a "(deprecated)" > tag placed at the beginning of the option description to highlight the > deprecation status [1]. > > Looking forward to your feedbacks on it. > > [1] https://issues.apache.org/jira/browse/FLINK-33240 > > Best, > Zhanghao Chen >
Re: [DISCUSS][FLINK-33240] Document deprecated options as well
Hi Samrat and Ruan, Thanks for the suggestion. I'm actually in favor of adding the deprecated options in the same section as the non-deprecated ones. This would make user search for descriptions of the replacement options more easily. It would be a different story for options deprecated because the related API/module is entirely deprecated, e.g. DataSet API. In that case, users would not search for replacement on an individual option but rather need to migrate to a new API, and it would be better to move these options to a separate section. WDYT? Best, Zhanghao Chen From: Samrat Deb Sent: Wednesday, November 1, 2023 15:31 To: dev@flink.apache.org Cc: u...@flink.apache.org Subject: Re: [DISCUSS][FLINK-33240] Document deprecated options as well Thanks for the proposal , +1 for adding deprecated identifier [Thought] Can we have seperate section / page for deprecated configs ? Wdut ? Bests, Samrat On Tue, 31 Oct 2023 at 3:44 PM, Alexander Fedulov < alexander.fedu...@gmail.com> wrote: > Hi Zhanghao, > > Thanks for the proposition. > In general +1, this sounds like a good idea as long it is clear that the > usage of these settings is discouraged. > Just one minor concern - the configuration page is already very long, do > you have a rough estimate of how many more options would be added with this > change? > > Best, > Alexander Fedulov > > On Mon, 30 Oct 2023 at 18:24, Matthias Pohl .invalid> > wrote: > > > Thanks for your proposal, Zhanghao Chen. I think it adds more > transparency > > to the configuration documentation. > > > > +1 from my side on the proposal > > > > On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen > > > wrote: > > > > > Hi Flink users and developers, > > > > > > Currently, Flink won't generate doc for the deprecated options. This > > might > > > confuse users when upgrading from an older version of Flink: they have > to > > > either carefully read the release notes or check the source code for > > > upgrade guidance on deprecated options. > > > > > > I propose to document deprecated options as well, with a "(deprecated)" > > > tag placed at the beginning of the option description to highlight the > > > deprecation status [1]. > > > > > > Looking forward to your feedbacks on it. > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-33240 > > > > > > Best, > > > Zhanghao Chen > > > > > >
Re: [VOTE] FLIP-376: Add DISTRIBUTED BY clause
+1 (non-binding) Best, Zhanghao Chen From: Timo Walther Sent: Monday, November 6, 2023 19:38 To: dev Subject: [VOTE] FLIP-376: Add DISTRIBUTED BY clause Hi everyone, I'd like to start a vote on FLIP-376: Add DISTRIBUTED BY clause[1] which has been discussed in this thread [2]. The vote will be open for at least 72 hours unless there is an objection or not enough votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause [2] https://lists.apache.org/thread/z1p4f38x9ohv29y4nlq8g1wdxwfjwxd1 Cheers, Timo
[DISCUSS] FLIP-395: Deprecate Global Aggregator Manager
Hi all, I'd like to start a discussion of FLIP-395: Deprecate Global Aggregator Manager [1]. Global Aggregate Manager was introduced in [2] to support event time synchronization across sources and more generally, coordination of parallel tasks. AFAIK, this was only used in the Kinesis source for an early version of watermark alignment. Operator Coordinator, introduced in FLIP-27, provides a more powerful and elegant solution for that need and is part of the new source API standard. FLIP-217 further provides a complete solution for watermark alignment of source splits on top of the Operator Coordinator mechanism. Furthermore, Global Aggregate Manager manages state in JobMaster object, causing problems for adaptive parallelism changes [3]. Therefore, I propose to deprecate the use of Global Aggregate Manager, which can improve the maintainability of the Flink codebase without compromising its functionality. Looking forward to your feedbacks, thanks. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Deprecate+Global+Aggregator+Manager [2] https://issues.apache.org/jira/browse/FLINK-10886 [3] https://issues.apache.org/jira/browse/FLINK-31245 Best, Zhanghao Chen
[DISCUSS] FLIP-397: Add config options for administrator JVM options
Hi devs, I'd like to start a discussion on FLIP-397: Add config options for administrator JVM options [1]. In production environments, users typically develop and operate their Flink jobs through a managed platform. Users may need to add JVM options to their Flink applications (e.g. to tune GC options). They typically use the env.java.opts.x series of options to do so. Platform administrators also have a set of JVM options to apply by default, e.g. to use JVM 17, enable GC logging, or apply pretuned GC options, etc. Both use cases will need to set the same series of options and will clobber one another. Similar issues have been described in SPARK-23472 [2]. Therefore, I propose adding a set of default JVM options for administrator use that prepends the user-set extra JVM options. Looking forward to hearing from you. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options [2] https://issues.apache.org/jira/browse/SPARK-23472 Best, Zhanghao Chen
Re: [DISCUSS] Contribute Flink Doris Connector to the Flink community
Sounds great. Thanks for driving this. Best, Zhanghao Chen From: wudi <676366...@qq.com.INVALID> Sent: Sunday, November 26, 2023 18:22 To: dev@flink.apache.org Subject: [DISCUSS] Contribute Flink Doris Connector to the Flink community Hi all, At present, Flink Connector and Flink's repository have been decoupled[1]. At the same time, the Flink-Doris-Connector[3] has been maintained based on the Apache Doris[2] community. I think the Flink Doris Connector can be migrated to the Flink community because it It is part of Flink Connectors and can also expand the ecosystem of Flink Connectors. I volunteer to move this forward if I can. [1] https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development [2] https://doris.apache.org/ [3] https://github.com/apache/doris-flink-connector -- Brs, di.wu
Re: [VOTE] FLIP-391: Deprecate RuntimeContext#getExecutionConfig
+1 (non-binding) Best, Zhanghao Chen From: Junrui Lee Sent: Tuesday, November 28, 2023 12:34 To: dev@flink.apache.org Subject: [VOTE] FLIP-391: Deprecate RuntimeContext#getExecutionConfig Hi everyone, Thank you to everyone for the feedback on FLIP-391: Deprecate RuntimeContext#getExecutionConfig[1] which has been discussed in this thread [2]. I would like to start a vote for it. The vote will be open for at least 72 hours unless there is an objection or not enough votes. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465937 [2]https://lists.apache.org/thread/cv4x3372g5txw1j20r5l1vwlqrvjoqq5
Re: [VOTE] FLIP-364: Improve the restart-strategy
+1 (non-binding) Best, Zhanghao Chen From: Rui Fan <1996fan...@gmail.com> Sent: Monday, November 13, 2023 11:01 To: dev Subject: [VOTE] FLIP-364: Improve the restart-strategy Hi everyone, Thank you to everyone for the feedback on FLIP-364: Improve the restart-strategy[1] which has been discussed in this thread [2]. I would like to start a vote for it. The vote will be open for at least 72 hours unless there is an objection or not enough votes. [1] https://cwiki.apache.org/confluence/x/uJqzDw [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym Best, Rui
Re: [DISCUSS] FLIP-395: Deprecate Global Aggregator Manager
Hi Benchao, I think part of the reason is that a general global coordination mechanism is complex and hence subject to some internals changes in the future. Instead of directly exposing the full mechanism to users, it might be better to expose some well-defined subset of the feature set to users. I'm also ccing the email to Piotr and David for their suggestions on this. Best, Zhanghao Chen From: Benchao Li Sent: Monday, November 27, 2023 13:03 To: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP-395: Deprecate Global Aggregator Manager +1 for the idea. Currently OperatorCoordinator is still marked as @Internal, shouldn't it be a public API already? Besides, GlobalAggregatorManager supports coordination between different operators, but OperatorCoordinator only supports coordination within one operator. And CoordinatorStore introduced in FLINK-24439 opens the door for multi operators. Again, should it also be a public API too? Weihua Hu 于2023年11月27日周一 11:05写道: > > Thanks Zhanghao for driving this FLIP. > > +1 for this. > > Best, > Weihua > > > On Mon, Nov 20, 2023 at 5:49 PM Zhanghao Chen > wrote: > > > Hi all, > > > > I'd like to start a discussion of FLIP-395: Deprecate Global Aggregator > > Manager [1]. > > > > Global Aggregate Manager was introduced in [2] to support event time > > synchronization across sources and more generally, coordination of parallel > > tasks. AFAIK, this was only used in the Kinesis source for an early version > > of watermark alignment. Operator Coordinator, introduced in FLIP-27, > > provides a more powerful and elegant solution for that need and is part of > > the new source API standard. FLIP-217 further provides a complete solution > > for watermark alignment of source splits on top of the Operator Coordinator > > mechanism. Furthermore, Global Aggregate Manager manages state in JobMaster > > object, causing problems for adaptive parallelism changes [3]. > > > > Therefore, I propose to deprecate the use of Global Aggregate Manager, > > which can improve the maintainability of the Flink codebase without > > compromising its functionality. > > > > Looking forward to your feedbacks, thanks. > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Deprecate+Global+Aggregator+Manager > > [2] https://issues.apache.org/jira/browse/FLINK-10886 > > [3] https://issues.apache.org/jira/browse/FLINK-31245 > > > > Best, > > Zhanghao Chen > > -- Best, Benchao Li
Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios
Thanks for driving this effort, Yangze! The proposal overall LGTM. Other from the throughput enhancement in the OLAP scenario, the separation of leader election/discovery services and the metadata persistence services will also make the HA impl clearer and easier to maintain. Just a minor comment on naming: would it better to rename PersistentServices to PersistenceServices, as usually we put a noun before Services? Best, Zhanghao Chen From: Yangze Guo Sent: Tuesday, December 19, 2023 17:33 To: dev Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios Hi, there, We would like to start a discussion thread on "FLIP-403: High Availability Services for OLAP Scenarios"[1]. Currently, Flink's high availability service consists of two mechanisms: leader election/retrieval services for JobManager and persistent services for job metadata. However, these mechanisms are set up in an "all or nothing" manner. In OLAP scenarios, we typically only require leader election/retrieval services for JobManager components since jobs usually do not have a restart strategy. Additionally, the persistence of job states can negatively impact the cluster's throughput, especially for short query jobs. To address these issues, this FLIP proposes splitting the HighAvailabilityServices into LeaderServices and PersistentServices, and enable users to independently configure the high availability strategies specifically related to jobs. Please find more details in the FLIP wiki document [1]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios Best, Yangze Guo
Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options
Thanks everyone. I'll start voting after the New Year's holiday if there's no further comment. Best, Zhanghao Chen From: Benchao Li Sent: Friday, December 22, 2023 21:18 To: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options +1 from my side, I also met some scenarios that I wanted to set some JVM options by default for all Flink jobs before, such as '-XX:-DontCompileHugeMethods', without it, some generated big methods won't be optimized in JVM C2 compiler, leading to poor performance. Zhanghao Chen 于2023年11月27日周一 20:04写道: > > Hi devs, > > I'd like to start a discussion on FLIP-397: Add config options for > administrator JVM options [1]. > > In production environments, users typically develop and operate their Flink > jobs through a managed platform. Users may need to add JVM options to their > Flink applications (e.g. to tune GC options). They typically use the > env.java.opts.x series of options to do so. Platform administrators also have > a set of JVM options to apply by default, e.g. to use JVM 17, enable GC > logging, or apply pretuned GC options, etc. Both use cases will need to set > the same series of options and will clobber one another. Similar issues have > been described in SPARK-23472 [2]. > > Therefore, I propose adding a set of default JVM options for administrator > use that prepends the user-set extra JVM options. > > Looking forward to hearing from you. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options > [2] https://issues.apache.org/jira/browse/SPARK-23472 > > Best, > Zhanghao Chen -- Best, Benchao Li
Re: [VOTE] FLIP-398: Improve Serialization Configuration And Usage In Flink
+1 (non-binding) Best, Zhanghao Chen From: Yong Fang Sent: Wednesday, December 27, 2023 14:54 To: dev Subject: [VOTE] FLIP-398: Improve Serialization Configuration And Usage In Flink Hi devs, Thanks for all feedback about the FLIP-398: Improve Serialization Configuration And Usage In Flink [1] which has been discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-398%3A+Improve+Serialization+Configuration+And+Usage+In+Flink [2] https://lists.apache.org/thread/m67s4qfrh660lktpq7yqf9docvvf5o9l Best, Fang Yong
Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options
Hi Xiangyu, The proposed new options are targeted on experienced Flink platform administrators instead of normal end users, and a one-by-one mapping from non-default option to the default option variant might be easier for users to understand. Also, although JM and TM tend to use the same set of JVM args in most times, there're cases where different set of JVM args are preferable. So I am leaning towards the current design, WDYT? Best, Zhanghao Chen From: xiangyu feng Sent: Friday, December 29, 2023 20:20 To: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP-397: Add config options for administrator JVM options Hi Zhanghao, Thanks for driving this. +1 for the overall idea. One minor question, do we need separate administrator JVM options for both JobManager and TaskManager? Or just one administrator JVM option for all? I'm afraid of 6 jvm options(env.java.opts.all\env.java.default-opts.all\env.java.opts.jobmanager\env.java.default-opts.jobmanager\env.java.opts.taskmanager\env.java.default-opts.taskmanager) may confuse users. Regards, Xiangyu Yong Fang 于2023年12月27日周三 15:36写道: > +1 for this, we have met jobs that need to set GC policies different from > the default ones to improve performance. Separating the default and > user-set ones can help us better manage them. > > Best, > Fang Yong > > On Fri, Dec 22, 2023 at 9:18 PM Benchao Li wrote: > > > +1 from my side, > > > > I also met some scenarios that I wanted to set some JVM options by > > default for all Flink jobs before, such as > > '-XX:-DontCompileHugeMethods', without it, some generated big methods > > won't be optimized in JVM C2 compiler, leading to poor performance. > > > > Zhanghao Chen 于2023年11月27日周一 20:04写道: > > > > > > Hi devs, > > > > > > I'd like to start a discussion on FLIP-397: Add config options for > > administrator JVM options [1]. > > > > > > In production environments, users typically develop and operate their > > Flink jobs through a managed platform. Users may need to add JVM options > to > > their Flink applications (e.g. to tune GC options). They typically use > the > > env.java.opts.x series of options to do so. Platform administrators also > > have a set of JVM options to apply by default, e.g. to use JVM 17, enable > > GC logging, or apply pretuned GC options, etc. Both use cases will need > to > > set the same series of options and will clobber one another. Similar > issues > > have been described in SPARK-23472 [2]. > > > > > > Therefore, I propose adding a set of default JVM options for > > administrator use that prepends the user-set extra JVM options. > > > > > > Looking forward to hearing from you. > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options > > > [2] https://issues.apache.org/jira/browse/SPARK-23472 > > > > > > Best, > > > Zhanghao Chen > > > > > > > > -- > > > > Best, > > Benchao Li > > >
Re: [ANNOUNCE] New Apache Flink Committer - Alexander Fedulov
Congrats, Alex! Best, Zhanghao Chen From: Maximilian Michels Sent: Tuesday, January 2, 2024 20:15 To: dev Cc: Alexander Fedulov Subject: [ANNOUNCE] New Apache Flink Committer - Alexander Fedulov Happy New Year everyone, I'd like to start the year off by announcing Alexander Fedulov as a new Flink committer. Alex has been active in the Flink community since 2019. He has contributed more than 100 commits to Flink, its Kubernetes operator, and various connectors [1][2]. Especially noteworthy are his contributions on deprecating and migrating the old Source API functions and test harnesses, the enhancement to flame graphs, the dynamic rescale time computation in Flink Autoscaling, as well as all the small enhancements Alex has contributed which make a huge difference. Beyond code contributions, Alex has been an active community member with his activity on the mailing lists [3][4], as well as various talks and blog posts about Apache Flink [5][6]. Congratulations Alex! The Flink community is proud to have you. Best, The Flink PMC [1] https://github.com/search?type=commits&q=author%3Aafedulov+org%3Aapache [2] https://issues.apache.org/jira/browse/FLINK-28229?jql=status%20in%20(Resolved%2C%20Closed)%20AND%20assignee%20in%20(afedulov)%20ORDER%20BY%20resolved%20DESC%2C%20created%20DESC [3] https://lists.apache.org/list?dev@flink.apache.org:lte=100M:Fedulov [4] https://lists.apache.org/list?u...@flink.apache.org:lte=100M:Fedulov [5] https://flink.apache.org/2020/01/15/advanced-flink-application-patterns-vol.1-case-study-of-a-fraud-detection-system/ [6] https://www.ververica.com/blog/presenting-our-streaming-concepts-introduction-to-flink-video-series
[DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Dear Flink devs, I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change [1]. Currently, when user does not explicitly set operator UIDs, the chaining behavior will still affect state compatibility, as the generation of the Operator ID is dependent on its chained output nodes. For example, a simple source->sink DAG with source and sink chained together is state incompatible with an otherwise identical DAG with source and sink unchained (either because the parallelisms of the two ops are changed to be unequal or chaining is disabled). This greatly limits the flexibility to perform chain-breaking/building for performance tuning. The dependency on chained output nodes for Operator ID generation can be traced back to Flink 1.2. It is unclear at this point on why chained output nodes are involved in the algorithm, but the following history background might be related: prior to Flink 1.3, Flink runtime takes the snapshots by the operator ID of the first vertex in a chain, so it somewhat makes sense to include chained output nodes into the algorithm as chain-breaking/building is expected to break state-compatibility anyway. Given that operator-level state recovery within a chain has long been supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that is agnostic of the chaining behavior of operators, so that users are free to tune the parallelism of individual operators without worrying about state incompatibility. We can make the V3 hasher an optional choice in Flink 1.19, and make it the default hasher in 2.0 for backwards compatibility. Looking forward to your suggestions on it, thanks~ [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change Best, Zhanghao Chen
Re: [Discuss] FLIP-407: Improve Flink Client performance in interactive scenarios
Thanks for driving this effort on improving the interactive use experience of Flink. The proposal overall looks good to me. Best, Zhanghao Chen From: xiangyu feng Sent: Tuesday, December 26, 2023 16:51 To: dev@flink.apache.org Subject: [Discuss] FLIP-407: Improve Flink Client performance in interactive scenarios Hi devs, I'm opening this thread to discuss FLIP-407: Improve Flink Client performance in interactive scenarios. The POC test results and design doc can be found at: FLIP-407 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters> . Currently, Flink Client is mainly designed for one time interaction with the Flink Cluster. All the resources(http connections, threads, ha services) and instances(ClusterDescriptor, ClusterClient, RestClient) are created and recycled for each interaction. This works well when users do not need to interact frequently with Flink Cluster and also saves resource usage since resources are recycled immediately after each usage. However, in OLAP or StreamingWarehouse scenarios, users might submit interactive jobs to a dedicated Flink Session Cluster very often. In this case, we find that for short queries that can finish in less than 1s in Flink Cluster will still have E2E latency greater than 2s. Hence, we propose this FLIP to improve the Flink Client performance in this scenario. This could also improve the user experience when using session debug mode. The major change in this FLIP is that there will be a new introduced option *'execution.interactive-client'*. When this option is enabled, Flink Client will reuse all the necessary resources to improve interactive performance, including: HA Services, HTTP connections, threads and all kinds of instances related to a long-running Flink Cluster. The default value of this option will be false, then Flink Client will behave as before. Also, this FLIP proposed a configurable RetryStrategy when fetching results from client-side to Flink Cluster. In interactive scenarios, this can save more than 15% of TM CPU usage without performance degradation. Looking forward to your feedback, thanks. Best regards, Xiangyu
[VOTE] FLIP-397: Add config options for administrator JVM options
Hi everyone, Thanks for all the feedbacks on FLIP-397 [1], which proposes to add a set of default JVM options for administrator use that prepends the user-set extra JVM options for easier platform-wide JVM pre-tuning. It has been discussed in [2]. I'd like to start a vote. The vote will be open for at least 72 hours (until January 8th 12:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options [2] https://lists.apache.org/thread/cflonyrfd1ftmyrpztzj3ywckbq41jzg Best, Zhanghao Chen
Re: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations
Thanks for driving this. I'm not familiar with the listed advanced Curator configs, but the previous added switch for disabling ensemble tracking [1] saved us when deploying Flink in a cloud env where ZK can only be accessible via URLs. That being said, +1 for the overall idea, these configs may help users in certain scenarios sooner or later. [1] https://issues.apache.org/jira/browse/FLINK-31780 Best, Zhanghao Chen From: Alex Nitavsky Sent: Thursday, December 14, 2023 21:20 To: dev@flink.apache.org Subject: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations Hi all, I would like to start a discussion thread for: *FLIP-402: Extend ZooKeeper Curator configurations *[1] * Problem statement * Currently Flink misses several Apache Curator configurations, which could be useful for Flink deployment with ZooKeeper as HA provider. * Proposed solution * We have inspected all possible options for Apache Curator and proposed those which could be valuable for Flink users: - high-availability.zookeeper.client.authorization [2] - high-availability.zookeeper.client.maxCloseWaitMs [3] - high-availability.zookeeper.client.simulatedSessionExpirationPercent [4] The proposed way is to reflect those properties into Flink configuration options for Apache ZooKeeper. Looking forward to your feedback and suggestions. Kind regards Oleksandr [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-402%3A+Extend+ZooKeeper+Curator+configurations [2] https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#authorization(java.lang.String,byte%5B%5D) [3] https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#maxCloseWaitMs(int) [4] https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#simulatedSessionExpirationPercent(int)
Re: FLIP-413: Enable unaligned checkpoints by default
Hi Piotr, As a platform administer who runs kilos of Flink jobs, I'd be against the idea to enable unaligned cp by default for our jobs. It may help a significant portion of the users, but the subtle issues around unaligned CP for a few jobs will probably raise a lot more on-calls and incidents. From my point of view, we'd better not enable it by default before removing all the limitations listed in https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/#limitations. Best, Zhanghao Chen From: Piotr Nowojski Sent: Friday, January 5, 2024 21:41 To: dev Subject: FLIP-413: Enable unaligned checkpoints by default Hi! I would like to propose by default to enable unaligned checkpoints and also simultaneously increase the aligned checkpoints timeout from 0ms to 5s. I think this change is the right one to do for the majority of Flink users. For more rationale please take a look into the short FLIP-413 [1]. What do you all think? Best, Piotrek https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default
Re: [VOTE] FLIP-397: Add config options for administrator JVM options
Thank you all! Closing the vote. The result will be sent in a separate email. Best, Zhanghao Chen From: Zhanghao Chen Sent: Thursday, January 4, 2024 10:29 To: dev Subject: [VOTE] FLIP-397: Add config options for administrator JVM options Hi everyone, Thanks for all the feedbacks on FLIP-397 [1], which proposes to add a set of default JVM options for administrator use that prepends the user-set extra JVM options for easier platform-wide JVM pre-tuning. It has been discussed in [2]. I'd like to start a vote. The vote will be open for at least 72 hours (until January 8th 12:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-397%3A+Add+config+options+for+administrator+JVM+options [2] https://lists.apache.org/thread/cflonyrfd1ftmyrpztzj3ywckbq41jzg Best, Zhanghao Chen
[RESULT][VOTE] FLIP-397: Add config options for administrator JVM options
Hi everyone, I'm happy to announce that FLIP-397: Add config options for administrator JVM options, has been accepted with 4 approving votes (3 binding) [1]: - Benchao Li (binding) - Rui Fan (binding) - Xiangyu Feng (non-binding) - Yong Fang (binding) There're no disapproving votes. Thanks again to everyone who participated in the discussion and voting. [1] https://lists.apache.org/thread/5qlr0xl0oyc9dnvcjr0q39pcrzyx4ohb Best, Zhanghao Chen
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Hi David, Thanks for the comments. AFAIK, unaligned checkpoints are disabled for pointwise connections according to [1], let's wait Piotr for confirmation. The issue itself is not directly related to this proposal as well. If a user manually specifies UIDs for each of the chained operators and has unaligned checkpoints enabled, we will encounter the same issue if they decide to break the chain on a later restart and try to recover from a retained cp. [1] https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/ Best, Zhanghao Chen From: David Morávek Sent: Wednesday, January 10, 2024 6:26 To: dev@flink.apache.org ; Piotr Nowojski Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi Zhanghao, Thanks for the FLIP. What you're proposing makes a lot of sense +1 Have you thought about how this works with unaligned checkpoints in case you go from unchained to chained? I think it should be fine because this scenario should only apply to forward/rebalance scenarios where we, as far as I recall, force alignment anyway, so there should be no exchanges to snapshot. It might just work, but something to double-check. Maybe @Piotr Nowojski could confirm it. Best, D. On Wed, Jan 3, 2024 at 7:10 AM Zhanghao Chen wrote: > Dear Flink devs, > > I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID > generation for improved state compatibility on parallelism change [1]. > > Currently, when user does not explicitly set operator UIDs, the chaining > behavior will still affect state compatibility, as the generation of the > Operator ID is dependent on its chained output nodes. For example, a simple > source->sink DAG with source and sink chained together is state > incompatible with an otherwise identical DAG with source and sink unchained > (either because the parallelisms of the two ops are changed to be unequal > or chaining is disabled). This greatly limits the flexibility to perform > chain-breaking/building for performance tuning. > > The dependency on chained output nodes for Operator ID generation can be > traced back to Flink 1.2. It is unclear at this point on why chained output > nodes are involved in the algorithm, but the following history background > might be related: prior to Flink 1.3, Flink runtime takes the snapshots by > the operator ID of the first vertex in a chain, so it somewhat makes sense > to include chained output nodes into the algorithm as > chain-breaking/building is expected to break state-compatibility anyway. > > Given that operator-level state recovery within a chain has long been > supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that > is agnostic of the chaining behavior of operators, so that users are free > to tune the parallelism of individual operators without worrying about > state incompatibility. We can make the V3 hasher an optional choice in > Flink 1.19, and make it the default hasher in 2.0 for backwards > compatibility. > > Looking forward to your suggestions on it, thanks~ > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change > > Best, > Zhanghao Chen >
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Hi Yu, I haven't thought too much about the compatibility design before. By the nature of the problem, it's impossible to make V3 compatible with V2, what we can do is to somewhat better inform users when switching the hasher, but I don't have any good idea so far. Do you have any suggestions on this? Best, Zhanghao Chen From: Yu Chen Sent: Thursday, January 11, 2024 13:52 To: dev@flink.apache.org Cc: Piotr Nowojski ; zhanghao.c...@outlook.com Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi Zhanghao, Thanks for driving this, that’s really painful for us when we need to switch config `pipeline.operator-chaining`. But I have a Concern, according to FLIP description, modifying `isChainable` related code in `StreamGraphHasherV2` will cause the generated operator id to be changed, which will result in the user unable to recover from the old state (old and new Operator IDs can't be mapped). Therefore switching Hasher strategy (V2->V3 or V3->V2) will lead to an incompatibility, is there any relevant compatibility design considered? Best, Yu Chen 2024年1月10日 10:25,Zhanghao Chen 写道: Hi David, Thanks for the comments. AFAIK, unaligned checkpoints are disabled for pointwise connections according to [1], let's wait Piotr for confirmation. The issue itself is not directly related to this proposal as well. If a user manually specifies UIDs for each of the chained operators and has unaligned checkpoints enabled, we will encounter the same issue if they decide to break the chain on a later restart and try to recover from a retained cp. [1] https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/ Best, Zhanghao Chen From: David Morávek Sent: Wednesday, January 10, 2024 6:26 To: dev@flink.apache.org ; Piotr Nowojski Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi Zhanghao, Thanks for the FLIP. What you're proposing makes a lot of sense +1 Have you thought about how this works with unaligned checkpoints in case you go from unchained to chained? I think it should be fine because this scenario should only apply to forward/rebalance scenarios where we, as far as I recall, force alignment anyway, so there should be no exchanges to snapshot. It might just work, but something to double-check. Maybe @Piotr Nowojski could confirm it. Best, D. On Wed, Jan 3, 2024 at 7:10 AM Zhanghao Chen wrote: Dear Flink devs, I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change [1]. Currently, when user does not explicitly set operator UIDs, the chaining behavior will still affect state compatibility, as the generation of the Operator ID is dependent on its chained output nodes. For example, a simple source->sink DAG with source and sink chained together is state incompatible with an otherwise identical DAG with source and sink unchained (either because the parallelisms of the two ops are changed to be unequal or chaining is disabled). This greatly limits the flexibility to perform chain-breaking/building for performance tuning. The dependency on chained output nodes for Operator ID generation can be traced back to Flink 1.2. It is unclear at this point on why chained output nodes are involved in the algorithm, but the following history background might be related: prior to Flink 1.3, Flink runtime takes the snapshots by the operator ID of the first vertex in a chain, so it somewhat makes sense to include chained output nodes into the algorithm as chain-breaking/building is expected to break state-compatibility anyway. Given that operator-level state recovery within a chain has long been supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that is agnostic of the chaining behavior of operators, so that users are free to tune the parallelism of individual operators without worrying about state incompatibility. We can make the V3 hasher an optional choice in Flink 1.19, and make it the default hasher in 2.0 for backwards compatibility. Looking forward to your suggestions on it, thanks~ [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change Best, Zhanghao Chen
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Thanks for the input, Piotr. It might still be possible to make it compatible with the old snapshots, following the direction of FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. I'll discuss with Yu on more details. Best, Zhanghao Chen From: Piotr Nowojski Sent: Friday, January 12, 2024 1:55 To: Yu Chen Cc: Zhanghao Chen ; dev@flink.apache.org Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi, Using unaligned checkpoints is orthogonal to this FLIP. Yes, unaligned checkpoints are not supported for pointwise connections, so most of the cases go away anyway. It is possible to switch from unchained to chained subtasks by removing a keyBy exchange, and this would be a problem, but that's just one of the things that we claim that unaligned checkpoints do not support [1]. But as I stated above, this is an orthogonal issue to this FLIP. Regarding the proposal itself, generally speaking it makes sense to me as well. However I'm quite worried about the compatibility and/or migration path. The: > (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated. step would break the compatibility with Flink 1.xx snapshots. But as this is for v2.0, maybe that's not the end of the world? Best, Piotrek [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations czw., 11 sty 2024 o 12:10 Yu Chen mailto:yuchen.e...@gmail.com>> napisał(a): Hi Zhanghao, Actually, Stefan has done similar compatibility work in the early FLINK-5290[1], where he introduced the legacyStreamGraphHashers list for hasher backward compatibility. We have attempted to implement a similar feature in the internal version of FLINK and tried to include the new hasher as part of the legacyStreamGraphHashers, which would ensure that the corresponding Operator State could be found at restore while ignoring the chaining condition(without changing the default hasher). However, we have found that such a solution may lead to some unexpected situations in some cases. While I have no time to find out the root cause recently. If you're interested, I'd be happy to discuss it with you and try to solve the problem. [1] https://issues.apache.org/jira/browse/FLINK-5290 Best, Yu Chen 2024年1月11日 15:07,Zhanghao Chen mailto:zhanghao.c...@outlook.com>> 写道: Hi Yu, I haven't thought too much about the compatibility design before. By the nature of the problem, it's impossible to make V3 compatible with V2, what we can do is to somewhat better inform users when switching the hasher, but I don't have any good idea so far. Do you have any suggestions on this? Best, Zhanghao Chen From: Yu Chen mailto:yuchen.e...@gmail.com>> Sent: Thursday, January 11, 2024 13:52 To: dev@flink.apache.org<mailto:dev@flink.apache.org> mailto:dev@flink.apache.org>> Cc: Piotr Nowojski mailto:piotr.nowoj...@gmail.com>>; zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> mailto:zhanghao.c...@outlook.com>> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi Zhanghao, Thanks for driving this, that’s really painful for us when we need to switch config `pipeline.operator-chaining`. But I have a Concern, according to FLIP description, modifying `isChainable` related code in `StreamGraphHasherV2` will cause the generated operator id to be changed, which will result in the user unable to recover from the old state (old and new Operator IDs can't be mapped). Therefore switching Hasher strategy (V2->V3 or V3->V2) will lead to an incompatibility, is there any relevant compatibility design considered? Best, Yu Chen 2024年1月10日 10:25,Zhanghao Chen mailto:zhanghao.c...@outlook.com>> 写道: Hi David, Thanks for the comments. AFAIK, unaligned checkpoints are disabled for pointwise connections according to [1], let's wait Piotr for confirmation. The issue itself is not directly related to this proposal as well. If a user manually specifies UIDs for each of the chained operators and has unaligned checkpoints enabled, we will encounter the same issue if they decide to break the chain on a later restart and try to recover from a retained cp. [1] https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpointing_under_backpressure/ Best, Zhanghao Chen From: David Morávek mailto:d...@apache.org>> Sent: Wednesday, January 10, 2024 6:26 To: dev@flink.apache.org<mailto:dev@flink.apache.org> mailto:dev@flink.apache.org>>; Piotr Nowojski mailto:piotr.nowoj...@gmail.com>> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator
Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
Thanks for driving this topic. I think this FLIP could help clean up the codebase to make it easier to maintain. +1 on it. Best, Zhanghao Chen From: Weihua Hu Sent: Monday, February 27, 2023 20:40 To: dev Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager Hi everyone, I would like to begin a discussion on FLIP-298: Unifying the Implementation of SlotManager[1]. There are currently two types of SlotManager in Flink: DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager should behave as DeclarativeSlotManager if the user does not configure the slot request profile. Therefore, this FLIP aims to unify the implementation of SlotManager in order to reduce maintenance costs. Looking forward to hearing from you. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager Best, Weihua
[DISCUSS] Deprecating GlobalAggregateManager
Hi dev, I'd like to open discussion on deprecating Global Aggregate Manager in favor of Operator Coordinator. 1. Global Aggregate Manager is rarely used and can be replaced by Opeator Coordinator. Global Aggregate Manager was introduced in [1]<https://issues.apache.org/jira/browse/FLINK-10886> to support event time synchronization across sources and more generally, coordination of parallel tasks. AFAIK, this was only used in the Kinesis source [2] for an early version of watermark alignment. Operator Coordinator, introduced in [3], provides a more powerful and elegant solution for that need and is part of the new source API standard. 2. Global Aggregate Manager manages state in JobMaster object, causing problems for adaptive parallelism changes. It maintains a state (the accumulators field in JobMaster) in JM memory. The accumulator state content is defined in user code. In my company, a user stores task parallelism in the accumulator, assuming task parallelism never changes. However, this assumption is broken when using adaptive scheduler. See [4] for more details. Therefore, I think we should deprecate the use of Global Aggregate Manager, which can improve the maintainability of the Flink codebase without compromising its functionality. Looking forward to your opinions on this. [1] https://issues.apache.org/jira/browse/FLINK-10886 [2] https://github.com/apache/flink-connector-aws/blob/d0817fecdcaa53c4bf039761c2d1a16e8fb9f89b/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/util/JobManagerWatermarkTracker.java [3] https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-SplitEnumerator [4] [FLINK-31245] Adaptive scheduler does not reset the state of GlobalAggregateManager when rescaling - ASF JIRA (apache.org)<https://issues.apache.org/jira/browse/FLINK-31245?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel> Best, Zhanghao Chen
Re: [VOTE] FLIP-291: Externalized Declarative Resource Management
Thanks for driving this. +1 (non-binding) Best, Zhanghao Chen From: David Mor?vek Sent: Tuesday, February 28, 2023 21:46 To: dev Subject: [VOTE] FLIP-291: Externalized Declarative Resource Management Hi Everyone, I want to start the vote on FLIP-291: Externalized Declarative Resource Management [1]. The FLIP was discussed in this thread [2]. The goal of the FLIP is to enable external declaration of the resource requirements of a running job. The vote will last for at least 72 hours (Friday, 3rd of March, 15:00 CET) unless there is an objection or insufficient votes. [1] https://lists.apache.org/thread/b8fnj127jsl5ljg6p4w3c4wvq30cnybh [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management Best, D.
Re: [Vote] FLIP-298: Unifying the Implementation of SlotManager
Thanks Weihua. +1 (non-binding) Best, Zhanghao Chen From: Weihua Hu Sent: Thursday, March 9, 2023 13:27 To: dev Subject: [Vote] FLIP-298: Unifying the Implementation of SlotManager Hi Everyone, I would like to start the vote on FLIP-298: Unifying the Implementation of SlotManager [1]. The FLIP was discussed in this thread [2]. This FLIP aims to unify the implementation of SlotManager in order to reduce maintenance costs. The vote will last for at least 72 hours (03/14, 15:00 UTC+8) unless there is an objection or insufficient votes. Thank you all. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager [2]https://lists.apache.org/thread/ocssfxglpc8z7cto3k8p44mrjxwr67r9 Best, Weihua
Re: [VOTE] FLIP-407: Improve Flink Client performance in interactive scenarios
+1 (non-binding) Best, Zhanghao Chen From: xiangyu feng Sent: Thursday, January 11, 2024 19:44 To: dev@flink.apache.org Subject: [VOTE] FLIP-407: Improve Flink Client performance in interactive scenarios Hi all, I would like to start the vote for FLIP-407: Improve Flink Client performance in interactive scenarios[1]. This FLIP was discussed in this thread [2]. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. Regards, Xiangyu [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+in+interactive+scenarios [2] https://lists.apache.org/thread/ccsv66ygffgqbv956bnknbpllj4t24kj
Re: [NOTICE] master branch cannot compile for now
Hi devs, The issue has already been resolved via a separate PR [1]. Feel free to rebase master and proceed~ [1] https://github.com/apache/flink/pull/24200 Best, Zhanghao Chen From: Benchao Li Sent: Friday, January 26, 2024 11:39 To: dev Subject: [NOTICE] master branch cannot compile for now Hi devs, I merged FLINK-33263[1] this morning (10:16 +8:00), and it based on an old commit which uses older guava version, so currently the master branch cannot compile. Zhanghao has discovered this in FLINK-33264[2], and the hotfix commit has been proposed in the same PR, hopefully we can merge it after CI passes (it may take a few hours). Sorry for the inconvenience. [1] https://github.com/apache/flink/pull/24128 [2] https://github.com/apache/flink/pull/24133 -- Best, Benchao Li
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com>, I've updated the FLIP [1] to include a design that allows for compatible hasher upgrade by adding StreamGraphHasherV2 to the legacy hasher list, which is actually a revival of the idea from FLIP-5290 [2] when StreamGraphHasherV2 was introduced in Flink 1.2. We're targeting to make V3 the default hasher in Flink 1.20 given that state-compatibility is no longer an issue. Take a review when you have a chance, and I'd like to especially thank @Yu Chen<mailto:yuchen.e...@gmail.com> for the through offline discussion and code debugging help to make this possible. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change [2] https://issues.apache.org/jira/browse/FLINK-5290 Best, Zhanghao Chen ____ From: Zhanghao Chen Sent: Friday, January 12, 2024 10:46 To: Piotr Nowojski ; Yu Chen Cc: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Thanks for the input, Piotr. It might still be possible to make it compatible with the old snapshots, following the direction of FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. I'll discuss with Yu on more details. Best, Zhanghao Chen From: Piotr Nowojski Sent: Friday, January 12, 2024 1:55 To: Yu Chen Cc: Zhanghao Chen ; dev@flink.apache.org Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi, Using unaligned checkpoints is orthogonal to this FLIP. Yes, unaligned checkpoints are not supported for pointwise connections, so most of the cases go away anyway. It is possible to switch from unchained to chained subtasks by removing a keyBy exchange, and this would be a problem, but that's just one of the things that we claim that unaligned checkpoints do not support [1]. But as I stated above, this is an orthogonal issue to this FLIP. Regarding the proposal itself, generally speaking it makes sense to me as well. However I'm quite worried about the compatibility and/or migration path. The: > (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated. step would break the compatibility with Flink 1.xx snapshots. But as this is for v2.0, maybe that's not the end of the world? Best, Piotrek [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations czw., 11 sty 2024 o 12:10 Yu Chen mailto:yuchen.e...@gmail.com>> napisał(a): Hi Zhanghao, Actually, Stefan has done similar compatibility work in the early FLINK-5290[1], where he introduced the legacyStreamGraphHashers list for hasher backward compatibility. We have attempted to implement a similar feature in the internal version of FLINK and tried to include the new hasher as part of the legacyStreamGraphHashers, which would ensure that the corresponding Operator State could be found at restore while ignoring the chaining condition(without changing the default hasher). However, we have found that such a solution may lead to some unexpected situations in some cases. While I have no time to find out the root cause recently. If you're interested, I'd be happy to discuss it with you and try to solve the problem. [1] https://issues.apache.org/jira/browse/FLINK-5290 Best, Yu Chen 2024年1月11日 15:07,Zhanghao Chen mailto:zhanghao.c...@outlook.com>> 写道: Hi Yu, I haven't thought too much about the compatibility design before. By the nature of the problem, it's impossible to make V3 compatible with V2, what we can do is to somewhat better inform users when switching the hasher, but I don't have any good idea so far. Do you have any suggestions on this? Best, Zhanghao Chen From: Yu Chen mailto:yuchen.e...@gmail.com>> Sent: Thursday, January 11, 2024 13:52 To: dev@flink.apache.org<mailto:dev@flink.apache.org> mailto:dev@flink.apache.org>> Cc: Piotr Nowojski mailto:piotr.nowoj...@gmail.com>>; zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> mailto:zhanghao.c...@outlook.com>> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hi Zhanghao, Thanks for driving this, that’s really painful for us when we need to switch config `pipeline.operator-chaining`. But I have a Concern, according to FLIP description, modifying `isChainable` related code in `StreamGraphHasherV2` will cause the generated operator id to be changed, which will result in the user unable to recover from the old state (old and new Operator IDs can't be mapped
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Hi Chesnay, AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can share how you allow UID setting for SQL jobs. We've explored providing a visualized DAG editor for SQL jobs that allows UID setting on our internal platform, but most users found it too complicated to use. Another possible way is to utilize SQL hints, but that's complicated as well. From our experience, many SQL users are not familiar with Flink, what they want is an experience similar to writing a normal SQL in MySQL, without involving much extra concepts like the DAG and the UID. In fact, some DataStream and PyFlink users also share the same concern. On the other hand, some performance-tuning is inevitable for a long-running jobs in production, and parallelism tuning is among the most common techniques. FLIP-367 [1] and FLIP-146 [2] allow user to tune the parallelism of source and sinks, and both are well-received in the discussion thread. Users definitely don't want to lost state after a parallelism tuning, which is highly risky at present. Putting these together, I think the FLIP has a high value in production. Through offline discussion, I leant that multiple companies have developed or trying to develop similar hasher changes in their internal distribution, including ByteDance, Xiaohongshu, and Bilibili. It'll be great if we can improve the SQL experience for all community users as well, WDYT? Best, Zhanghao Chen From: Chesnay Schepler Sent: Thursday, February 8, 2024 2:01 To: dev@flink.apache.org ; Zhanghao Chen ; Piotr Nowojski ; Yu Chen Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change The FLIP is a bit weird to be honest. It only applies in cases where users haven't set uids, but that goes against best-practices and as far as I'm told SQL also sets UIDs everywhere. I'm wondering if this is really worth the effort. On 07/02/2024 10:23, Zhanghao Chen wrote: > After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com>, I've > updated the FLIP [1] to include a design that allows for compatible hasher > upgrade by adding StreamGraphHasherV2 to the legacy hasher list, which is > actually a revival of the idea from FLIP-5290 [2] when StreamGraphHasherV2 > was introduced in Flink 1.2. We're targeting to make V3 the default hasher in > Flink 1.20 given that state-compatibility is no longer an issue. Take a > review when you have a chance, and I'd like to especially thank @Yu > Chen<mailto:yuchen.e...@gmail.com> for the through offline discussion and > code debugging help to make this possible. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change > [2] https://issues.apache.org/jira/browse/FLINK-5290 > > Best, > Zhanghao Chen > > From: Zhanghao Chen > Sent: Friday, January 12, 2024 10:46 > To: Piotr Nowojski ; Yu Chen > Cc: dev@flink.apache.org > Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for > improved state compatibility on parallelism change > > Thanks for the input, Piotr. It might still be possible to make it compatible > with the old snapshots, following the direction of > FLINK-5290<https://issues.apache.org/jira/browse/FLINK-5290> suggested by Yu. > I'll discuss with Yu on more details. > > Best, > Zhanghao Chen > > From: Piotr Nowojski > Sent: Friday, January 12, 2024 1:55 > To: Yu Chen > Cc: Zhanghao Chen ; dev@flink.apache.org > > Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for > improved state compatibility on parallelism change > > Hi, > > Using unaligned checkpoints is orthogonal to this FLIP. > > Yes, unaligned checkpoints are not supported for pointwise connections, so > most of the cases go away anyway. > It is possible to switch from unchained to chained subtasks by removing a > keyBy exchange, and this would be > a problem, but that's just one of the things that we claim that unaligned > checkpoints do not support [1]. But as > I stated above, this is an orthogonal issue to this FLIP. > > Regarding the proposal itself, generally speaking it makes sense to me as > well. However I'm quite worried about > the compatibility and/or migration path. The: >> (v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated. > step would break the compatibility with Flink 1.xx snapshots. But as this is > for v2.0, maybe that's not the end of > the world? > > Best, > Piotrek > > [1] > https://nightlies.apach
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
Hi Piotr, Thanks for the comment. I agree that compiled plan is the ultimate tool for Flink SQL if one wants to make any changes to query later, and this FLIP indeed is not essential in this sense. However, compiled plan is still too complicated for Flink newbies from my point of view. As I mentioned previously, our internal platform provides a visualized tool for editing the compiled plan but most users still find it complex. Therefore, the FLIP can still benefit users with better useability and the proposed changes are actually quite lightweight (just copying a new hasher with 2 lines deleted + extending the OperatorIdPair data structure) without much extra effort. Best, Zhanghao Chen From: Piotr Nowojski Sent: Thursday, February 8, 2024 14:50 To: Zhanghao Chen Cc: Chesnay Schepler ; dev@flink.apache.org ; Yu Chen Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change Hey > AFAIK, there's no way to set UIDs for a SQL job, AFAIK you can't set UID manually, but Flink SQL generates a compiled plan of a query with embedded UIDs. As I understand it, using a compiled plan is the preferred (only?) way for Flink SQL if one wants to make any changes to query later on or support Flink's runtime upgrades, without losing the state. If that's the case, what would be the usefulness of this FLIP? Only for DataStream API for users that didn't know that they should have manually configured UIDs? But they have the workaround to actually post-factum add the UIDs anyway, right? So maybe indeed Chesnay is right that this FLIP is not that helpful/worth the extra effort? Best, Piotrek czw., 8 lut 2024 o 03:55 Zhanghao Chen napisał(a): > Hi Chesnay, > > AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can > share how you allow UID setting for SQL jobs. We've explored providing a > visualized DAG editor for SQL jobs that allows UID setting on our internal > platform, but most users found it too complicated to use. Another > possible way is to utilize SQL hints, but that's complicated as well. From > our experience, many SQL users are not familiar with Flink, what they want > is an experience similar to writing a normal SQL in MySQL, without > involving much extra concepts like the DAG and the UID. In fact, some > DataStream and PyFlink users also share the same concern. > > On the other hand, some performance-tuning is inevitable for a > long-running jobs in production, and parallelism tuning is among the most > common techniques. FLIP-367 [1] and FLIP-146 [2] allow user to tune the > parallelism of source and sinks, and both are well-received in the > discussion thread. Users definitely don't want to lost state after a > parallelism tuning, which is highly risky at present. > > Putting these together, I think the FLIP has a high value in production. > Through offline discussion, I leant that multiple companies have developed > or trying to develop similar hasher changes in their internal distribution, > including ByteDance, Xiaohongshu, and Bilibili. It'll be great if we can > improve the SQL experience for all community users as well, WDYT? > > Best, > Zhanghao Chen > ------ > *From:* Chesnay Schepler > *Sent:* Thursday, February 8, 2024 2:01 > *To:* dev@flink.apache.org ; Zhanghao Chen < > zhanghao.c...@outlook.com>; Piotr Nowojski ; Yu > Chen > *Subject:* Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID > generation for improved state compatibility on parallelism change > > The FLIP is a bit weird to be honest. It only applies in cases where > users haven't set uids, but that goes against best-practices and as far > as I'm told SQL also sets UIDs everywhere. > > I'm wondering if this is really worth the effort. > > On 07/02/2024 10:23, Zhanghao Chen wrote: > > After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com > >, I've updated the FLIP [1] to include a design > that allows for compatible hasher upgrade by adding StreamGraphHasherV2 to > the legacy hasher list, which is actually a revival of the idea from > FLIP-5290 [2] when StreamGraphHasherV2 was introduced in Flink 1.2. We're > targeting to make V3 the default hasher in Flink 1.20 given that > state-compatibility is no longer an issue. Take a review when you have a > chance, and I'd like to especially thank @Yu Chen< > mailto:yuchen.e...@gmail.com > for the through > offline discussion and code debugging help to make this possible. > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on
Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change
We're only concerned with parallelism tuning here (with the same Flink version). The plans will be compatible as long as the operator IDs keep the same. Currently, this only holds if we do not break/create a chain, and we want to make it hold when we break/create a chain as well. That's what the FLIP is all about. The typical user story is that one has a job with a uniform parallelism in the first place. The source is chained with an expensive operator. Later on, the job parallelism needs to be increased, but the source can't due to limits like Kafka partition number. The user then configures different parallelism for the source and the remaining part of the job, which breaks a chain, and leads to state-incompatibility. Best, Zhanghao Chen From: Chesnay Schepler Sent: Thursday, February 8, 2024 18:12 To: dev@flink.apache.org ; Martijn Visser Cc: Yu Chen Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change How exactly are you tuning SQL jobs without compiled plans while ensuring that the resulting compiled plans are compatible? That's explicitly not supported by Flink, hence why CompiledPlans exist. If you change _anything_ the planner is free to generate a completely different plan, where you have no guarantees that you can map the state between one another. On 08/02/2024 09:42, Martijn Visser wrote: > Hi, > >> However, compiled plan is still too complicated for Flink newbies from my >> point of view. > I don't think that the compiled plan was ever positioned to be a > simple solution. If you want to have an easy approach, we have a > declarative solution in place with SQL and/or the Table API imho. > > Best regards, > > Martijn > > On Thu, Feb 8, 2024 at 9:14 AM Zhanghao Chen > wrote: >> Hi Piotr, >> >> Thanks for the comment. I agree that compiled plan is the ultimate tool for >> Flink SQL if one wants to make any changes to >> query later, and this FLIP indeed is not essential in this sense. However, >> compiled plan is still too complicated for Flink newbies from my point of >> view. As I mentioned previously, our internal platform provides a visualized >> tool for editing the compiled plan but most users still find it complex. >> Therefore, the FLIP can still benefit users with better useability and the >> proposed changes are actually quite lightweight (just copying a new hasher >> with 2 lines deleted + extending the OperatorIdPair data structure) without >> much extra effort. >> >> Best, >> Zhanghao Chen >> >> From: Piotr Nowojski >> Sent: Thursday, February 8, 2024 14:50 >> To: Zhanghao Chen >> Cc: Chesnay Schepler ; dev@flink.apache.org >> ; Yu Chen >> Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation >> for improved state compatibility on parallelism change >> >> Hey >> >>> AFAIK, there's no way to set UIDs for a SQL job, >> AFAIK you can't set UID manually, but Flink SQL generates a compiled plan >> of a query with embedded UIDs. As I understand it, using a compiled plan is >> the preferred (only?) way for Flink SQL if one wants to make any changes to >> query later on or support Flink's runtime upgrades, without losing the >> state. >> >> If that's the case, what would be the usefulness of this FLIP? Only for >> DataStream API for users that didn't know that they should have manually >> configured UIDs? But they have the workaround to actually post-factum add >> the UIDs anyway, right? So maybe indeed Chesnay is right that this FLIP is >> not that helpful/worth the extra effort? >> >> Best, >> Piotrek >> >> czw., 8 lut 2024 o 03:55 Zhanghao Chen >> napisał(a): >> >>> Hi Chesnay, >>> >>> AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can >>> share how you allow UID setting for SQL jobs. We've explored providing a >>> visualized DAG editor for SQL jobs that allows UID setting on our internal >>> platform, but most users found it too complicated to use. Another >>> possible way is to utilize SQL hints, but that's complicated as well. From >>> our experience, many SQL users are not familiar with Flink, what they want >>> is an experience similar to writing a normal SQL in MySQL, without >>> involving much extra concepts like the DAG and the UID. In fact, some >>> DataStream and PyFlink users also share the same concern. >>> >>> On the other hand, some performance-tuning is
Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun
Congrats, Jiaba! Best, Zhanghao Chen From: Qingsheng Ren Sent: Monday, February 19, 2024 17:53 To: dev ; jiabao...@apache.org Subject: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun Hi everyone, On behalf of the PMC, I'm happy to announce Jiabao Sun as a new Flink Committer. Jiabao began contributing in August 2022 and has contributed 60+ commits for Flink main repo and various connectors. His most notable contribution is being the core author and maintainer of MongoDB connector, which is fully functional in DataStream and Table/SQL APIs. Jiabao is also the author of FLIP-377 and the main contributor of JUnit 5 migration in runtime and table planner modules. Beyond his technical contributions, Jiabao is an active member of our community, participating in the mailing list and consistently volunteering for release verifications and code reviews with enthusiasm. Please join me in congratulating Jiabao for becoming an Apache Flink committer! Best, Qingsheng (on behalf of the Flink PMC)
Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener
+1 (non-binding) Best, Zhanghao Chen From: Yong Fang Sent: Wednesday, February 28, 2024 10:12 To: dev Subject: [VOTE] FLIP-314: Support Customized Job Lineage Listener Hi devs, I would like to restart a vote about FLIP-314: Support Customized Job Lineage Listener[1]. Previously, we added lineage related interfaces in FLIP-314. Before the interfaces were developed and merged into the master, @Maciej and @Zhenqiu provided valuable suggestions for the interface from the perspective of the lineage system. So we updated the interfaces of FLIP-314 and discussed them again in the discussion thread [2]. So I am here to initiate a new vote on FLIP-314, the vote will be open for at least 72 hours unless there is an objection or insufficient votes [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc Best, Fang Yong
Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project
Congratulations! Best, Zhanghao Chen From: Yu Li Sent: Thursday, March 28, 2024 15:55 To: d...@paimon.apache.org Cc: dev ; user Subject: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project CC the Flink user and dev mailing list. Paimon originated within the Flink community, initially known as Flink Table Store, and all our incubating mentors are members of the Flink Project Management Committee. I am confident that the bonds of enduring friendship and close collaboration will continue to unite the two communities. And congratulations all! Best Regards, Yu On Wed, 27 Mar 2024 at 20:35, Guojun Li wrote: > > Congratulations! > > Best, > Guojun > > On Wed, Mar 27, 2024 at 5:24 PM wulin wrote: > > > Congratulations~ > > > > > 2024年3月27日 15:54,王刚 写道: > > > > > > Congratulations~ > > > > > >> 2024年3月26日 10:25,Jingsong Li 写道: > > >> > > >> Hi Paimon community, > > >> > > >> I’m glad to announce that the ASF board has approved a resolution to > > >> graduate Paimon into a full Top Level Project. Thanks to everyone for > > >> your help to get to this point. > > >> > > >> I just created an issue to track the things we need to modify [2], > > >> please comment on it if you feel that something is missing. You can > > >> refer to apache documentation [1] too. > > >> > > >> And, we already completed the GitHub repo migration [3], please update > > >> your local git repo to track the new repo [4]. > > >> > > >> You can run the following command to complete the remote repo tracking > > >> migration. > > >> > > >> git remote set-url origin https://github.com/apache/paimon.git > > >> > > >> If you have a different name, please change the 'origin' to your remote > > name. > > >> > > >> Please join me in celebrating! > > >> > > >> [1] > > https://incubator.apache.org/guides/transferring.html#life_after_graduation > > >> [2] https://github.com/apache/paimon/issues/3091 > > >> [3] https://issues.apache.org/jira/browse/INFRA-25630 > > >> [4] https://github.com/apache/paimon > > >> > > >> Best, > > >> Jingsong Lee > > > >
Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan
Congrats, Zakelly! Best, Zhanghao Chen From: Shawn Huang Sent: Wednesday, April 17, 2024 9:59 To: dev@flink.apache.org Subject: Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan Congratulations, Zakelly! Best, Shawn Huang Feng Jin 于2024年4月17日周三 00:16写道: > Congratulations! > > Best, > Feng > > On Tue, Apr 16, 2024 at 10:43 PM Ferenc Csaky > wrote: > > > Congratulations! > > > > Best, > > Ferenc > > > > > > > > > > On Tuesday, April 16th, 2024 at 16:28, Jeyhun Karimov < > > je.kari...@gmail.com> wrote: > > > > > > > > > > > Congratulations Zakelly! > > > > > > Regards, > > > Jeyhun > > > > > > On Tue, Apr 16, 2024 at 6:35 AM Feifan Wang zoltar9...@163.com wrote: > > > > > > > Congratulations, Zakelly!—— > > > > > > > > Best regards, > > > > > > > > Feifan Wang > > > > > > > > At 2024-04-15 10:50:06, "Yuan Mei" yuanmei.w...@gmail.com wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > On behalf of the PMC, I'm happy to let you know that Zakelly Lan > has > > > > > become > > > > > a new Flink Committer! > > > > > > > > > > Zakelly has been continuously contributing to the Flink project > since > > > > > 2020, > > > > > with a focus area on Checkpointing, State as well as frocksdb (the > > default > > > > > on-disk state db). > > > > > > > > > > He leads several FLIPs to improve checkpoints and state APIs, > > including > > > > > File Merging for Checkpoints and configuration/API reorganizations. > > He is > > > > > also one of the main contributors to the recent efforts of > > "disaggregated > > > > > state management for Flink 2.0" and drives the entire discussion in > > the > > > > > mailing thread, demonstrating outstanding technical depth and > > breadth of > > > > > knowledge. > > > > > > > > > > Beyond his technical contributions, Zakelly is passionate about > > helping > > > > > the > > > > > community in numerous ways. He spent quite some time setting up the > > Flink > > > > > Speed Center and rebuilding the benchmark pipeline after the > > original one > > > > > was out of lease. He helps build frocksdb and tests for the > upcoming > > > > > frocksdb release (bump rocksdb from 6.20.3->8.10). > > > > > > > > > > Please join me in congratulating Zakelly for becoming an Apache > Flink > > > > > committer! > > > > > > > > > > Best, > > > > > Yuan (on behalf of the Flink PMC) > > >
[DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI
Hi all, I'd like start a discussion on FLIP-460: Display source/sink I/O metrics on Flink Web UI [1]. Currently, the numRecordsIn & numBytesIn metrics for sources and the numRecordsOut & numBytesOut metrics for sinks are always 0 on the Flink web dashboard. It is especially confusing for simple ETL jobs where there's a single chained operator with 0 input rate and 0 output rate. For years, Flink newbies have been asking "Why my job has zero consumption rate and zero production rate, is it actually working?" Connectors implementing FLIP-33 [2] have already exposed these metrics on the operator level, this FLIP takes a further step to expose them on the job overview page on Flink Web UI. Looking forward to everyone's feedback and suggestions. Thanks! [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics Best, Zhanghao Chen
Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo
Congrats, Weijie! Best, Zhanghao Chen From: Hang Ruan Sent: Tuesday, June 4, 2024 16:37 To: dev@flink.apache.org Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo Congratulations Weijie! Best, Hang Yanfei Lei 于2024年6月4日周二 16:24写道: > Congratulations! > > Best, > Yanfei > > Leonard Xu 于2024年6月4日周二 16:20写道: > > > > Congratulations! > > > > Best, > > Leonard > > > > > 2024年6月4日 下午4:02,Yangze Guo 写道: > > > > > > Congratulations! > > > > > > Best, > > > Yangze Guo > > > > > > On Tue, Jun 4, 2024 at 4:00 PM Weihua Hu > wrote: > > >> > > >> Congratulations, Weijie! > > >> > > >> Best, > > >> Weihua > > >> > > >> > > >> On Tue, Jun 4, 2024 at 3:03 PM Yuxin Tan > wrote: > > >> > > >>> Congratulations, Weijie! > > >>> > > >>> Best, > > >>> Yuxin > > >>> > > >>> > > >>> Yuepeng Pan 于2024年6月4日周二 14:57写道: > > >>> > > >>>> Congratulations ! > > >>>> > > >>>> > > >>>> Best, > > >>>> Yuepeng Pan > > >>>> > > >>>> At 2024-06-04 14:45:45, "Xintong Song" > wrote: > > >>>>> Hi everyone, > > >>>>> > > >>>>> On behalf of the PMC, I'm very happy to announce that Weijie Guo > has > > >>>> joined > > >>>>> the Flink PMC! > > >>>>> > > >>>>> Weijie has been an active member of the Apache Flink community for > many > > >>>>> years. He has made significant contributions in many components, > > >>> including > > >>>>> runtime, shuffle, sdk, connectors, etc. He has driven / > participated in > > >>>>> many FLIPs, authored and reviewed hundreds of PRs, been > consistently > > >>>> active > > >>>>> on mailing lists, and also helped with release management of 1.20 > and > > >>>>> several other bugfix releases. > > >>>>> > > >>>>> Congratulations and welcome Weijie! > > >>>>> > > >>>>> Best, > > >>>>> > > >>>>> Xintong (on behalf of the Flink PMC) > > >>>> > > >>> > > >
Re: [ANNOUNCE] New Apache Flink PMC Member - Fan Rui
Congrats, Rui! Best, Zhanghao Chen From: Piotr Nowojski Sent: Wednesday, June 5, 2024 18:01 To: dev ; rui fan <1996fan...@gmail.com> Subject: [ANNOUNCE] New Apache Flink PMC Member - Fan Rui Hi everyone, On behalf of the PMC, I'm very happy to announce another new Apache Flink PMC Member - Fan Rui. Rui has been active in the community since August 2019. During this time he has contributed a lot of new features. Among others: - Decoupling Autoscaler from Kubernetes Operator, and supporting Standalone Autoscaler - Improvements to checkpointing, flamegraphs, restart strategies, watermark alignment, network shuffles - Optimizing the memory and CPU usage of large operators, greatly reducing the risk and probability of TaskManager OOM He reviewed a significant amount of PRs and has been active both on the mailing lists and in Jira helping to both maintain and grow Apache Flink's community. He is also our current Flink 1.20 release manager. In the last 12 months, Rui has been the most active contributor in the Flink Kubernetes Operator project, while being the 2nd most active Flink contributor at the same time. Please join me in welcoming and congratulating Fan Rui! Best, Piotrek (on behalf of the Flink PMC)
Re: Poor Load Balancing across TaskManagers for Multiple Kafka Sources
Hi Kevin, The problem here is about how to evenly distribute partitions from multiple Kafka topics to tasks, while FLIP-370 is only concerned about how to evenly distribute tasks to slots & taskmanagers, so FLIP-370 won't help here. Best, Zhanghao Chen From: Kevin Lam Sent: Thursday, June 6, 2024 2:32 To: dev@flink.apache.org Subject: Poor Load Balancing across TaskManagers for Multiple Kafka Sources Hey all, I'm seeing an issue with poor load balancing across TaskManagers for Kafka Sources using the Flink SQL API and wondering if FLIP-370 will help with it, or if not, interested in any ideas the community has to mitigate the issue. The Kafka SplitEnumerator uses the following logic to assign split owners (code pointer <https://github.com/apache/flink-connector-kafka/blob/00c9c8c74121136a0c1710ac77f307dc53adae99/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java#L469> ): ``` static int getSplitOwner(TopicPartition tp, int numReaders) { int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFF) % numReaders; return (startIndex + tp.partition()) % numReaders; } ``` However this can result in imbalanced distribution of kafka partition consumers across task managers. To illustrate, I created a pipeline that consumes from 2 kafka topics, each with 8 partitions, and just sinks them to a blackhole connector sink. For a parallelism of 16 and 1 task slot per TaskManager, we'd ideally expect each TaskManager to get its own kafka partition. ie. 16 partitions (8 partitions from each topic) split evenly across TaskManagers. However, due the algorithm I linked and how the startIndex is computed, I have observed a bunch of TaskManagers with 2 partitions (one from each topic), and some TaskManager completely idle. I've also run an experiment with the same pipeline where I set parallelism such that each task manager gets exactly 1 partition, and compared it against when each task manager gets exactly 2 partitions (one from each topic). I ensured this was the case by setting an appropriate parallelism, and ran the jobs on an application cluster. Since the partitions are fixed, the extra parallelism if any isn't used. The case where there is exactly 1 partition per TaskManager processes a fixed set of data 20% faster. I was reading FLIP-370 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling> and understand it will improve task scheduling in certain scenarios. Will FLIP-370 help with this KafkaSource scenario? If not any ideas to improve the subtask scheduling for KafkaSources? Ideally we don't need to carefully consider the partition + resulting task distribution when selecting our parallelism values. Thanks for your help!
Re: [June 15 Feature Freeze][SUMMARY] Flink 1.20 Release Sync 11/06/2024
Hi Rui, Thanks for the summary! A quick update here: FLIP-398 was decided not to go into 1.20, as it was just found that the effort to add dedicated serialization support for Maps, Sets and Lists, will break state-compatibility. I will revert the relevant changes soon. Best, Zhanghao Chen From: Rui Fan <1996fan...@gmail.com> Sent: Wednesday, June 12, 2024 12:59 To: dev Subject: [June 15 Feature Freeze][SUMMARY] Flink 1.20 Release Sync 11/06/2024 Dear devs, This is the sixth meeting for Flink 1.20 release[1] cycle. I'd like to share the information synced in the meeting. - Feature Freeze It is worth noting that there are only 3 days left until the feature freeze time(June 15, 2024, 00:00 CEST(UTC+2)), and developers need to pay attention to the feature freeze time. After checked with all contributors of 1.20 FLIPs, we don't need to postpone the feature freeze time. Please reply to this email if other features are valuable and it's better to be merged in 1.20, thanks. - Features: So far we've had 16 flips/features: - 6 flips/features are done - 8 flips/features are doing and release managers checked with corresponding contributors - 7 of these flips/features can be completed before June 15, 2024, 00:00 CEST(UTC+2) - We were unable to contact the contributor of FLIP-436 - 2 flips/features won't make in 1.20 - Blockers: We don't have any blocker right now, thanks to everyone who fixed blockers before. - Sync meeting[2]: The next meeting is 18/06/2024 10am (UTC+2) and 4pm (UTC+8), please feel free to join us. Lastly, we encourage attendees to fill out the topics to be discussed at the bottom of 1.20 wiki page[1] a day in advance, to make it easier for everyone to understand the background of the topics, thanks! [1] https://cwiki.apache.org/confluence/display/FLINK/1.20+Release [2] https://meet.google.com/mtj-huez-apu Best, Robert, Weijie, Ufuk and Rui
Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join
Thanks for driving this, Weijie. Usually, the data distribution of the external system is closely related to the keys, e.g. computing the bucket index by key hashcode % bucket num, so I'm not sure about how much difference there are between partitioning by key and a custom partitioning strategy. Could you give a more concrete example in production on when a custom partitioning strategy will outperform partitioning by key? Since you've mentioned Paimon in doc, maybe an example on Paimon. Best, Zhanghao Chen From: weijie guo Sent: Friday, June 7, 2024 9:59 To: dev Subject: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join Hi devs, I'd like to start a discussion about FLIP-462[1]: Support Custom Data Distribution for Input Stream of Lookup Join. Lookup Join is an important feature in Flink, It is typically used to enrich a table with data that is queried from an external system. If we interact with the external systems for each incoming record, we incur significant network IO and RPC overhead. Therefore, most connectors introduce caching to reduce the per-record level query overhead. However, because the data distribution of Lookup Join's input stream is arbitrary, the cache hit rate is sometimes unsatisfactory. We want to introduce a mechanism for the connector to tell the Flink planner its desired input stream data distribution or partitioning strategy. This can significantly reduce the amount of cached data and improve performance of Lookup Join. You can find more details in this FLIP[1]. Looking forward to hearing from you, thanks! Best regards, Weijie [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join
Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join
Thanks for the clarification, that makes sense. +1 for the proposal. Best, Zhanghao Chen From: weijie guo Sent: Wednesday, June 12, 2024 14:20 To: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join Hi Zhanghao, Thanks for the reply! > Could you give a more concrete example in production on when a custom partitioning strategy will outperform partitioning by key The key point here is partitioning logic cannot be fully expressed with all or part of the join key. That is, even if we know which fields are used to calculate buckets, still have to face the following problem: 1. The mapping from the bucket field to the bucket id is not necessarily done via hashcode, and even if it is, the hash algorithm may be different from the one used in Flink. The planner can't know how to do this mapping. 2. In order to get the bucket id, we have to mod the bucket number, but planner has no notion of bucket number. Best regards, Weijie Zhanghao Chen 于2024年6月12日周三 13:55写道: > Thanks for driving this, Weijie. Usually, the data distribution of the > external system is closely related to the keys, e.g. computing the bucket > index by key hashcode % bucket num, so I'm not sure about how much > difference there are between partitioning by key and a custom partitioning > strategy. Could you give a more concrete example in production on when a > custom partitioning strategy will outperform partitioning by key? Since > you've mentioned Paimon in doc, maybe an example on Paimon. > > Best, > Zhanghao Chen > > From: weijie guo > Sent: Friday, June 7, 2024 9:59 > To: dev > Subject: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input > Stream of Lookup Join > > Hi devs, > > > I'd like to start a discussion about FLIP-462[1]: Support Custom Data > Distribution for Input Stream of Lookup Join. > > > Lookup Join is an important feature in Flink, It is typically used to > enrich a table with data that is queried from an external system. > If we interact with the external systems for each incoming record, we > incur significant network IO and RPC overhead. > > Therefore, most connectors introduce caching to reduce the per-record > level query overhead. However, because the data distribution of Lookup > Join's input stream is arbitrary, the cache hit rate is sometimes > unsatisfactory. > > > We want to introduce a mechanism for the connector to tell the Flink > planner its desired input stream data distribution or partitioning > strategy. This can significantly reduce the amount of cached data and > improve performance of Lookup Join. > > > You can find more details in this FLIP[1]. Looking forward to hearing > from you, thanks! > > > Best regards, > > Weijie > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join >
Re: [VOTE] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join
+1 (unbinding) Best, Zhanghao Chen From: weijie guo Sent: Monday, June 17, 2024 10:13 To: dev Subject: [VOTE] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join Hi everyone, Thanks for all the feedback about the FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join [1]. The discussion thread is here [2]. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. Best, Weijie [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join [2] https://lists.apache.org/thread/kds2zrcdmykrz5lmn0hf9m4phdl60nfb
Re: [VOTE] FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler
+1 (non-binding) Best, Zhanghao Chen From: Matthias Pohl Sent: Monday, June 17, 2024 16:24 To: dev@flink.apache.org Subject: [VOTE] FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler Hi everyone, the discussion in [1] about FLIP-461 [2] is kind of concluded. I am starting a vote on this one here. The vote will be open for at least 72 hours (i.e. until June 20, 2024; 8:30am UTC) unless there are any objections. The FLIP will be considered accepted if 3 binding votes (from active committers according to the Flink bylaws [3]) are gathered by the community. Best, Matthias [1] https://lists.apache.org/thread/nnkonmsv8xlk0go2sgtwnphkhrr5oc3y [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler [3] https://cwiki.apache.org/confluence/display/FLINK/Flink+Bylaws#FlinkBylaws-Approvals
Re: [DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI
Thanks for all. As the discussion is quite concluded, I'll start voting in a separate thread. From: Leonard Xu Sent: Monday, July 15, 2024 11:22 To: dev Subject: Re: [DISCUSS] FLIP-460: Display source/sink I/O metrics on Flink Web UI Thanks for zhanghao for kicking off the thread. > It is especially confusing for simple ETL jobs where there's a single > chained operator with 0 input rate and 0 output rate. This case have confused flink users for a long time, +1 for the FLIP. Best, Leonard
[VOTE] FLIP-460: Display source/sink I/O metrics on Flink Web UI
Hi everyone, Thanks for all the feedback about the FLIP-460: Display source/sink I/O metrics on Flink Web UI [1]. The discussion thread is here [2]. I'd like to start a vote on it. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355 [2] https://lists.apache.org/thread/sy271nhd2jr1r942f29xbvbgq7fsd841 Best, Zhanghao Chen
[RESULT][VOTE] FLIP-460: Display source/sink I/O metrics on Flink Web UI
Hi everyone, I'm delighted to announce that FLIP-460 [1] has been accepted. There were 13 votes in favor: - Yong Fang (binding) - Robert Metzger (binding) - Leonard Xu (binding) - Becket Qin (binding) - Ahmed Hamdy (non-binding) - Aleksandr Pilipenko (non-binding) - Xintong Song (binding) - Rui Fan (binding) - Zakelly Lan (binding) - Yanquan Lv (non-binding) - Hang Ruan (binding) - Xiqian Yu (non-binding) - Yuepeng Pan (non-binding) There were no votes against. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=309496355 Best, Zhanghao Chen
Re: [VOTE] FLIP-468: Introducing StreamGraph-Based Job Submission
Thanks for driving it. +1 (non-binding) Best, Zhanghao Chen From: Junrui Lee Sent: Friday, July 26, 2024 11:02 To: dev@flink.apache.org Subject: [VOTE] FLIP-468: Introducing StreamGraph-Based Job Submission Hi everyone, Thanks for all the feedback about FLIP-468: Introducing StreamGraph-Based Job Submission [1]. The discussion thread can be found here [2]. The vote will be open for at least 72 hours unless there are any objections or insufficient votes. Best, Junrui [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-468%3A+Introducing+StreamGraph-Based+Job+Submission [2] https://lists.apache.org/thread/l65mst4prx09rmloydn64z1w1zp8coyz
Re: [ANNOUNCE] New Apache Flink Committer - Junrui Li
Congrats, Junrui! Best, Zhanghao Chen From: Zhu Zhu Sent: Tuesday, November 5, 2024 19:59 To: dev ; Junrui Lee Subject: [ANNOUNCE] New Apache Flink Committer - Junrui Li Hi everyone, On behalf of the PMC, I'm happy to announce that Junrui Li has become a new Flink Committer! Junrui has been an active contributor to the Apache Flink project for two years. He had been the driver and major developer of 8 FLIPs, contributed 100+ commits with tens of thousands of code lines. His contributions mainly focus on enhancing Flink batch execution capabilities, including enabling parallelism inference by default(FLIP-283), supporting progress recovery after JM failover(FLIP-383), and supporting adaptive optimization of logical execution plan (FLIP-468/469). Furthermore, Junrui did a lot of work to improve Flink's configuration layer, addressing technical debt and enhancing its user-friendliness. He is also active in mailing lists, participating in discussions and answering user questions. Please join me in congratulating Junrui Li for becoming an Apache Flink committer. Best, Zhu (on behalf of the Flink PMC)
Re: [DISCUSS] Is it a bug that the AdaptiveScheduler does not prioritize releasing TaskManagers during downscaling in Application mode?
Thanks for the effort, Yuepeng! The approach LGTM. Best, Zhanghao Chen From: Yuepeng Pan Sent: Saturday, January 25, 2025 21:36 To: dev@flink.apache.org Subject: Re: [DISCUSS] Is it a bug that the AdaptiveScheduler does not prioritize releasing TaskManagers during downscaling in Application mode? Hi all, It hasn't received further responses to this email over the past few days, and as such, we currently lack sufficient feedback to draw a definitive conclusion. I'd like to proceed by merging both approaches as much as possible to reach the final consensus. I'll move forward with this plan to address the issue and update the PR[1] accordingly: > - It's agreeded to optimize/fix this issue in the 1.x TLS versions. > - The primary goal of this optimization/fix is to minimize the number of > TaskManagers used in application mode only. > - The optimized logic will be set as the default logic, and the original > logic will be retained when explicitly enabled through a parameter. Please rest assured, I am not rushing ahead recklessly. If there is anything inappropriate about this conclusion/measure or the reasoning supporting it, I will promptly halt and revise the action. > a. The default behavior of this new parameter (the optimized logic) aligns > with the expected behavior of most application deployment mode users > who would have this behavior enabled by default. > Therefore, it doesn't add complexity in terms of configuration for them. > b. However, this would require a small number of users who want to keep the > original behavior to actively configure this setting. > So, this still gives users the flexibility to choose. > c. Since this issue is only fixed by adopting this solution in version 1.x > LTS with application deployment mode, > this parameter doesn't have a plan for forward compatibility, and a new > parameter would also be acceptable to me. Thank you all very much for your attention and help. Best, Yuepeng Pan [1] https://github.com/apache/flink/pull/25218 On 2025/01/20 01:56:36 Yuepeng Pan wrote: > Hi, Maximilian, Rui, Matthias: > Thanks for the response, which gives me a general understanding of your > proposed approach and its implementation outline. > > Hi, All: > Thank you all very much for the discussion and suggestions. > > Based on the discussions we have received so far, > we have reached a preliminary consensus: > > - When the job runs in an application cluster, the default behavior > of AdaptiveScheduler not actively releasing Taskmanagers resources > during downscaling could be considered a > bug (At least from certain perspectives, this is the case). > - We should fix it in flink 1.x. > > However, there's still no consensus in the discussion on how to fix this > issue under the following conditions: > - Flink 1.x series versions and Application deployment mode (It's not about > to session cluster mode.) > > Strategy list: > > 1). Adding this behavior while being guarded by a feature flag/configuration > parameter in the 1.x LTS version. > (@Matthias If my understanding is incorrect, please correct me, thanks! ) > a. This enables the option for users to revert to the original behavior >eg. when ignoring idle resource occupation and focusing only >on the resource waiting time during rescaling, this can achieve some > positive impacts. > b. Introducing new parameters increases the complexity for users, >as Maximilian mentioned, we already have many parameters. > > 2). Set the behavior as the default without introducing new parameters in the > 1.x LTS version. > a. Avoid introducing new parameters and reduce complexity for users. > b. This disables the option for users to revert to the original behavior. > > We have to seek some trade-offs/change between the two options above in order > to make a choice and reach a consensus on the conclusion. > > Although Option-1) increases the complexity for users to use, I prefer to > Option-1) due to the following reasons if we could set the default behavior > in Option - 1) to the new behavior: > a. This new parameter aligns with the expected behavior for most > application deployment mode users, >who would have this behavior enabled by default. >Therefore, it doesn't add complexity in terms of configuration for > them. > b. However, this would require users who want to keep the original > behavior to actively configure this setting. >So, this still gives users the flexibility to choose. > c. Since this issue is only fixed by adopting this solution in version > 1.x LTS with application deployment mode, > this parameter doesn't have a pla
Re: [ANNOUNCE] New Apache Flink PMC Member - Zakelly Lan
Congrats, Zakelly! Best, Zhanghao Chen From: Yuan Mei Sent: Tuesday, April 1, 2025 15:51 To: dev Subject: [ANNOUNCE] New Apache Flink PMC Member - Zakelly Lan On behalf of the PMC, I'm very happy to announce a new Apache Flink PMC Member - Zakelly Lan. Zakelly has been a dedicated and invaluable contributor to the Flink community for over five years. He became a committer early 2024 and is one of our most active contributors in the last year. Zakelly's expertise is in Flink state management and checkpointing. He has developed several key features and FLIPs in this area. Notably, he has been the driving force for the "Disaggregated State Management in Flink 2.0" project, making it to a successful completion. Disaggregated State Management is one of the most complicated features and important improvements in 2.0. He contributed and reviewed a significant amount of PRs and has been active both on the mailing lists and in Jira helping to both maintain and grow Apache Flink's community. In addition, Zakelly is the sole maintainer of the Flink benchmark environment and helps monitor performance regressions. Zakelly is also continuously helping to expand the Flink community. At FFA 2024 Asia, his presentation on "Introducing ForSt DB: The Disaggregated State Store for Flink 2.0" was well-received and highlighted his deep technical expertise. Please join me in welcoming and congratulating Zakelly! Best, Yuan Mei (on behalf of the Flink PMC)
Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput
Simply setting config pipeline.object-reuse=true should work for that. Best, Zhanghao Chen From: Winterchill <809025...@qq.com.INVALID> Sent: Wednesday, April 23, 2025 20:12 To: dev Subject: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput During our analysis of FlinkSQL, we found that the FlinkSQL data stream heavily uses `CopyingChainingOutput` for operator-to-operator data transmission. In the process, we observed that some operators, such as `WatermarkAssigner`, do not necessarily require the deep copy logic of `CopyingChainingOutput`. Additionally, we noticed that FlinkSQL does not support the `enableObjectReuse` parameter in `DataStream`. Based on this, we have the following questions: 1. Will FlinkSQL support `enableObjectReuse` in the future? 2. We would like try to modify FlinkSQL's internal logic to eliminate unnecessary deep copies. How can we do this safely? Do you have any suggestions?
Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput
You will get into trouble if some Op caches the data locally and its downstream Op in the same operator chain modifies the data in-place. AFAIK, all Flink SQL operators handle it properly (deep copy when caching is needed), but you should be careful with UDFs and custom connectors. Example: suppose we have two chained operators A->B with object reuse enabled, A caches the data in a local hash map, and B modifies data in-place. Now when a single row arrives, Op A first caches it in the map, the same object is directly passed to Op B, and Op B modifies it in-place. Note that the entry cached in Op A is also updated as object is reused. Best, Zhanghao Chen From: Winterchill <809025...@qq.com.INVALID> Sent: Thursday, April 24, 2025 19:01 To: dev Subject: Re: [DISCUSS] FlinkSQL support enableObjectReuse to optimize CopyingChainingOutput Thanks.I read about it in docs and have some questions: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/ The Flink documentation indicates that enabling `enableObjectReuse` may cause bugs, but the description is not detailed enough. In what situations would it lead to bugs? Can we use this param in production environment? ---Original--- From: "Zhanghao Chen"
Re: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in Planner
+1 (non-binding) Thanks for driving this. It's a nice useability improvement for performing partial-updates on datalakes. Best, Zhanghao Chen From: xiangyu feng Sent: Sunday, February 23, 2025 10:44 To: dev@flink.apache.org Subject: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in Planner Hi all, I would like to start the vote for FLIP-506: Support Reuse Multiple Table Sinks in Planner[1]. This FLIP was discussed in this thread [2]. The vote will be open for at least 72 hours unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner [2] https://lists.apache.org/thread/r1wo9sf3d1725fhwzrttvv56k4rc782m Regards, Xiangyu Feng
Re: [DISCUSSION] Upgrade to Kryo 5 for Flink 2.0
Hi Gyula, Thanks for bringing this up! Definitely +1 for upgrading Kryo in Flink 2.0. As a side note, it might be useful to introduce customizable generic serializer support like Spark, where you can switch to your own serializer via the "spark.serializer" [1] option. Users starting new applications can introduce their own serialization stack in this case to resolve Java compatibility issue is this case or for other performance issues. [1] https://spark.apache.org/docs/latest/configuration.html Best, Zhanghao Chen From: Gyula F?ra Sent: Friday, February 21, 2025 14:04 To: dev Subject: [DISCUSSION] Upgrade to Kryo 5 for Flink 2.0 Hey all! I would like to rekindle this discussion as it seems that it has stalled several times in the past and we are nearing the point in time where the decision has to be made with regards to 2.0. (we are already a bit late but nevermind) There has been numerous requests and efforts to upgrade Kryo to better support newer Java versions and Java native types. I think we can all agree that this change is inevitable one way or another. The latest JIRA for this seems to be: https://issues.apache.org/jira/browse/FLINK-3154 There is even an open PR that accomplishes this (currently in a state incompatible way) but based on the discussion it seems that with some extra complexity compatibility can even be preserved by having both the old and new Kryo versions active at the same time. The main question here is whether state compatibility is important for 2.0 with this regard or we want to bite the bullet and make this upgrade once and for all. Cheers, Gyula
[jira] [Created] (FLINK-31936) Support setting scale up max factor
Zhanghao Chen created FLINK-31936: - Summary: Support setting scale up max factor Key: FLINK-31936 URL: https://issues.apache.org/jira/browse/FLINK-31936 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Zhanghao Chen Currently, only scale down max factor is supported to be configured. We should also add a config for scale up max factor as well. In many cases, a job's performance won't improve after scaling up due to external bottlenecks. Although we can detect ineffective scaling up would block further scaling, but it already hurts if we scale too much in a single step which may even burn out external services. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31991) Update Autoscaler doc to reflect the changes brought by the new source scaling logic
Zhanghao Chen created FLINK-31991: - Summary: Update Autoscaler doc to reflect the changes brought by the new source scaling logic Key: FLINK-31991 URL: https://issues.apache.org/jira/browse/FLINK-31991 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Zhanghao Chen Attachments: image-2023-05-03-20-16-33-704.png The current statements on job requirements are outdated: ??- All sources must use the new Source API (most common connectors already do)?? ??- Source scaling requires sources to expose the standardized connector metrics for accessing backlog information (source scaling can be disabled)?? The Autoscaler doc needs to be updated to reflect the changes brought by the new source scaling logic ([FLINK-31326|[FLINK-31326] Disabled source scaling breaks downstream scaling if source busyTimeMsPerSecond is 0 - ASF JIRA (apache.org)]). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32037) The edge on Web UI is wrong after parallelism changes via parallelism overrides or AdaptiveScheduler
Zhanghao Chen created FLINK-32037: - Summary: The edge on Web UI is wrong after parallelism changes via parallelism overrides or AdaptiveScheduler Key: FLINK-32037 URL: https://issues.apache.org/jira/browse/FLINK-32037 Project: Flink Issue Type: Improvement Components: Runtime / Coordination, Runtime / REST Affects Versions: 1.17.0 Reporter: Zhanghao Chen *Background* After FLINK-30213, in case of parallelism changes to the JobGraph, as done via the AdaptiveScheduler or through providing JobVertexId overrides in PipelineOptions#PARALLELISM_OVERRIDES, when the consumer parallelism doesn't match the local parallelism, the original ForwardPartitioner will be replaced with the RebalancePartitioner. *Problem* Although the actual partitioner changes underneath, the ship strategy seen on the Web UI is still FORWARD. This is because the fix patch applies when we init StreamTask, and the job graph is not touched. Web UI uses the JSON plan generated from the job graph for display, and the ship strategy is get by JobEdge#getShipStrategyName. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32124) Add option to enable partition alignment for sources
Zhanghao Chen created FLINK-32124: - Summary: Add option to enable partition alignment for sources Key: FLINK-32124 URL: https://issues.apache.org/jira/browse/FLINK-32124 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Zhanghao Chen Currently, autoscaler did not consider balancing partitions among source tasks. In our production env, partition skew has proven to be a severe problem for many jobs. Especially in a job topology with all forward or rescale shuffles, partition skew on the source side can further lead to data imbalance in later operators. We should add an option to enable partition alignment for sources for that, but making it disabled by default as this has a side effect in that partition usu. has limited factors and enabling alignment will greatly limit our scaling choices. Also, if data among partitions are imbalanced in the first place, partition alignment won't help as well (this is not a common case inside our company though). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32127) Source busy time is inaccurate in many cases
Zhanghao Chen created FLINK-32127: - Summary: Source busy time is inaccurate in many cases Key: FLINK-32127 URL: https://issues.apache.org/jira/browse/FLINK-32127 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Zhanghao Chen We found that source busy time is inaccurate in many cases. The reason is that sources are usu. multi-threaded (Kafka and RocketMq for example), there is a fetcher thread fetching data from data source, and a consumer thread deserializes data with an blocking queue in between. A source is considered # *idle* if the consumer is blocked by fetching data from the queue # *backpressured* if the consumer is blocked by writing data to downstream operators # *busy* otherwise However, this means that if the bottleneck is on the fetcher side, the consumer will be often blocked by fetching data from the queue, the source idle time would be high, but in fact it is busy and consumes a lot of CPU. In some of our jobs, the source max busy time is only ~600 ms while it is actually reaching the limit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32820) ParameterTool is mistakenly marked as deprecated
Zhanghao Chen created FLINK-32820: - Summary: ParameterTool is mistakenly marked as deprecated Key: FLINK-32820 URL: https://issues.apache.org/jira/browse/FLINK-32820 Project: Flink Issue Type: Improvement Components: API / DataSet, API / DataStream Affects Versions: 1.18.0 Reporter: Zhanghao Chen ParameterTool and AbstractParameterTool in pacakge flink-java is mistakenly marked as deprecated in [FLINK-32558] Properly deprecate DataSet API - ASF JIRA (apache.org). They are widely used for handling application parameters and is also listed in the Flink user doc: [https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/.|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/application_parameters/] Also, they are not directly related to Dataset API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32821) Streaming examples failed to execute due to error in packaging
Zhanghao Chen created FLINK-32821: - Summary: Streaming examples failed to execute due to error in packaging Key: FLINK-32821 URL: https://issues.apache.org/jira/browse/FLINK-32821 Project: Flink Issue Type: Improvement Components: Examples Affects Versions: 1.18.0 Reporter: Zhanghao Chen 5 out of the 7 streaming examples failed to run: * Iteration & SessionWindowing & SocketWindowWordCount & WindowJoin failed to run due to java.lang.NoClassDefFoundError: org/apache/flink/streaming/examples/utils/ParameterTool * TopSpeedWindowing failed to run due to: Caused by: java.lang.ClassNotFoundException: org.apache.flink.connector.datagen.source.GeneratorFunction The NoClassDefFoundError with ParameterTool is introduced by [FLINK-32558] Properly deprecate DataSet API - ASF JIRA (apache.org), and we'd better resolve [FLINK-32820] ParameterTool is mistakenly marked as deprecated - ASF JIRA (apache.org) first before we come to a fix for this problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32822) Add connector option to control whether to enable auto-commit of offsets when checkpoints is enabled
Zhanghao Chen created FLINK-32822: - Summary: Add connector option to control whether to enable auto-commit of offsets when checkpoints is enabled Key: FLINK-32822 URL: https://issues.apache.org/jira/browse/FLINK-32822 Project: Flink Issue Type: Improvement Components: Connectors / Kafka Reporter: Zhanghao Chen When checkpointing is enabled, Flink Kafka connector commits the current consuming offset when checkpoints are *completed* although ** Kafka source does *NOT* rely on committed offsets for fault tolerance. When the checkpoint interval is long, the lag curve will behave in a zig-zag way: the lag will keep increasing, and suddenly drops on a complete checkpoint. It have led to some confusion for users as in [https://stackoverflow.com/questions/76419633/flink-kafka-source-commit-offset-to-error-offset-suddenly-increase-or-decrease] and may also affect external monitoring for setting up alarms (you'll have to set up with a high threshold due to the non-realtime commit of offsets) and autoscaling (the algorithm would need to pay extra effort to distinguish whether the backlog is actually growing or just because the checkpoint is not completed yet). Therefore, I think it is worthwhile to add an option to enable auto-commit of offsets when checkpoints is enabled. For DataStream API, it will be adding a configuration method. For Table API, it will be adding a new connector option which wires to the DataStream API configuration underneath. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32868) Document the need to backport FLINK-30213 for using autoscaler with older version Flinks
Zhanghao Chen created FLINK-32868: - Summary: Document the need to backport FLINK-30213 for using autoscaler with older version Flinks Key: FLINK-32868 URL: https://issues.apache.org/jira/browse/FLINK-32868 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Zhanghao Chen The current Autoscaler doc states on job requirements as the following: Job requirements: * The autoscaler currently only works with the latest [Flink 1.17|https://hub.docker.com/_/flink] or after backporting the following fixes to your 1.15/1.16 Flink image ** [Job vertex parallelism overrides|https://github.com/apache/flink/commit/23ce2281a0bb4047c64def9af7ddd5f19d88e2a9] (must have) ** [Support timespan for busyTime metrics|https://github.com/apache/flink/commit/a7fdab8b23cddf568fa32ee7eb804d7c3eb23a35] (good to have) However, https://issues.apache.org/jira/browse/FLINK-30213 is also crucial and need to be backported to 1.15/1.16 to enable autoscaling. We should add it to the doc as well, and marked as must have. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32872) Add option to control the default partitioner when the parallelism of upstream and downstream operator does not match
Zhanghao Chen created FLINK-32872: - Summary: Add option to control the default partitioner when the parallelism of upstream and downstream operator does not match Key: FLINK-32872 URL: https://issues.apache.org/jira/browse/FLINK-32872 Project: Flink Issue Type: New Feature Components: Runtime / Configuration Affects Versions: 1.17.0 Reporter: Zhanghao Chen *Problem* Currently, when the no partitioner is specified, FORWARD partitioner is used if the parallelism of upstream and downstream operator matches, REBALANCE partitioner used otherwise. However, this behavior is not configurable and can be undesirable in certain cases: # REBALANCE partitioner will create an all-to-all connection between upstream and downstream operators and consume a lot of extra CPU and memory resources when the parallelism is high in pipelining mode and RESCALE partitioner is desirable in this case. # For Flink SQL jobs, users cannot specify the partitioner directly so far. And for DataStream jobs, users may not want to explicitly set the partitioner everywhere. *Proposal* Add an option to control the default partitioner when the parallelism of upstream and downstream operator does not match. The option can have the name "pipeline.default-partioner-with-unmatched-parallelism" with REBALANCE as the default value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32980) Support env.java.opts.all & env.java.opts.cli config for starting Session clusters
Zhanghao Chen created FLINK-32980: - Summary: Support env.java.opts.all & env.java.opts.cli config for starting Session clusters Key: FLINK-32980 URL: https://issues.apache.org/jira/browse/FLINK-32980 Project: Flink Issue Type: Improvement Components: Command Line Client, Deployment / Scripts Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 *Problem* The following configs are supposed to be supported: |h5. env.java.opts.all|(none)|String|Java options to start the JVM of all Flink processes with.| |h5. env.java.opts.client|(none)|String|Java options to start the JVM of the Flink Client with.| However, the two configs do not take effect for starting Flink session clusters using kubernetes-session.sh and yarn-session.sh. This can lead to problems in complex production envs. For example, in my company, some nodes are IPv6-only, and the connection between Flink client and K8s/YARN control plane is via a domain name whose backend is on IPv4/v6 dual stack, and the JVM arg -Djava.net.preferIPv6Addresses=true needs to be set to make Java connect to IPv6 addresses in favor of IPv4 ones otherwise the K8s/YARN control plane is inaccessible. *Proposal* The fix is straight-forward, simply apply the following changes to the session scripts: ` # Add Client-specific JVM options FLINK_ENV_JAVA_OPTS="${FLINK_ENV_JAVA_OPTS} ${FLINK_ENV_JAVA_OPTS_CLI}" exec $JAVA_RUN $JVM_ARGS $FLINK_ENV_JAVA_OPTS xxx ` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32983) Support setting env.java.opts.all & env.java.opts.cli configs via dynamic properties on the CLI side
Zhanghao Chen created FLINK-32983: - Summary: Support setting env.java.opts.all & env.java.opts.cli configs via dynamic properties on the CLI side Key: FLINK-32983 URL: https://issues.apache.org/jira/browse/FLINK-32983 Project: Flink Issue Type: Improvement Components: Command Line Client, Deployment / Scripts Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 *Problem* The following configs are supposed to be supported: |h5. env.java.opts.all|(none)|String|Java options to start the JVM of all Flink processes with.| |h5. env.java.opts.client|(none)|String|Java options to start the JVM of the Flink Client with.| However, the two configs only takes effect on the Client side when they are set in the flink-conf files. In other words, configs set via -D or-yD on the CLI will not take effect, which is counter-intuitive and makes configuration less flexible. *Proposal* Add logic to parse configs set via -D or-yD in config.sh and make them has a higher precedence over configs set in flink-conf.yaml for env.java.opts.all & env.java.opts.client. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33123) Wrong dynamic replacement of partitioner from FORWARD to REBLANCE for autoscaler and adaptive scheduler and
Zhanghao Chen created FLINK-33123: - Summary: Wrong dynamic replacement of partitioner from FORWARD to REBLANCE for autoscaler and adaptive scheduler and Key: FLINK-33123 URL: https://issues.apache.org/jira/browse/FLINK-33123 Project: Flink Issue Type: Bug Components: Autoscaler, Runtime / Coordination Affects Versions: 1.17.0, 1.18.0 Reporter: Zhanghao Chen *Background* https://issues.apache.org/jira/browse/FLINK-30213 reported that the edge is wrong when the parallelism is changed for a vertex with a FORWARD edge, which is used by both the autoscaler and adaptive scheduler where one can change the vertex parallelism dynamically. Fix is applied to dynamically replace partitioner from FORWARD to REBLANCE on task deployment in {{{}StreamTask{}}}: {{private static void replaceForwardPartitionerIfConsumerParallelismDoesNotMatch(}} {{ Environment environment, NonChainedOutput streamOutput) {}} {{ Environment environment, NonChainedOutput streamOutput, int outputIndex) {}} {{ if (streamOutput.getPartitioner() instanceof ForwardPartitioner}} {{ && streamOutput.getConsumerParallelism()}} {{ && environment.getWriter(outputIndex).getNumberOfSubpartitions()}} {{ != environment.getTaskInfo().getNumberOfParallelSubtasks()) {}} {{ LOG.debug(}} {{ "Replacing forward partitioner with rebalance for {}",}} {{ environment.getTaskInfo().getTaskNameWithSubtasks());}} {{ streamOutput.setPartitioner(new RebalancePartitioner<>());}} {{ }}} {{ }}} *Problem* Unfortunately, the fix is still buggy in two aspects: # The connections between upstream and downstream tasks are determined by the distribution type of the partitioner when generating execution graph on the JM side. When the edge is FORWARD, the distribution type is POINTWISE, and Flink will try to evenly distribute subpartitions to all downstream tasks. If one want to change it to REBALANCE, the distribution type has to be changed to ALL_TO_ALL to make all-to-all connections between upstream and downstream tasks. However, the fix did not change the distribution type which makes the network connections be set up in a wrong way. # The FOWARD partitioner will be replaced if environment.getWriter(outputIndex).getNumberOfSubpartitions() equals to the task parallelism. However, the number of subpartitions here equals to the number of downstream tasks of this particular task, which is also determined by the distribution type of the partitioner when generating execution graph on the JM side. When ceil(downstream task parallelism / upstream task parallelism) = upstream task parallelism, we will have the number of subpartitions = task parallelism. In fact, for a normal job with a FORWARD edge without any autoscaling action, you will find that the partitioner is changed to REBALANCE internally as the number of subpartitions always equals to 1 in this case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33146) FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI
Zhanghao Chen created FLINK-33146: - Summary: FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI Key: FLINK-33146 URL: https://issues.apache.org/jira/browse/FLINK-33146 Project: Flink Issue Type: Improvement Components: Runtime / REST, Runtime / Web Frontend Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 Umbrella ticket for [FLIP-363: Unify the Representation of TaskManager Location in REST API and Web UI|https://cwiki.apache.org/confluence/display/FLINK/FLIP-363%3A+Unify+the+Representation+of+TaskManager+Location+in+REST+API+and+Web+UI]. This is a continuation of [FLINK-25371] Include data port as part of the host info for subtask detail panel on Web UI - ASF JIRA (apache.org). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33147) Introduce a new "endpoint" field in REST API to represent TaskManager endpoint (host + port) and deprecate the "host" field
Zhanghao Chen created FLINK-33147: - Summary: Introduce a new "endpoint" field in REST API to represent TaskManager endpoint (host + port) and deprecate the "host" field Key: FLINK-33147 URL: https://issues.apache.org/jira/browse/FLINK-33147 Project: Flink Issue Type: Sub-task Components: Runtime / REST Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33148) Update Web UI to adopt the new "endpoint" field in REST API
Zhanghao Chen created FLINK-33148: - Summary: Update Web UI to adopt the new "endpoint" field in REST API Key: FLINK-33148 URL: https://issues.apache.org/jira/browse/FLINK-33148 Project: Flink Issue Type: Sub-task Components: Runtime / Web Frontend Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33166) Support setting root logger level by config
Zhanghao Chen created FLINK-33166: - Summary: Support setting root logger level by config Key: FLINK-33166 URL: https://issues.apache.org/jira/browse/FLINK-33166 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Reporter: Zhanghao Chen Users currently cannot change logging level by config and have to modify the cumbersome logger configuration file manually. We'd better provide a shortcut and support setting root logger level by config. There're a number configs already to set logger configurations, like {{env.log.dir}} for logging dir, {{env.log.max}} for max number of old logging file to save. We can name the new config {{{}env.log.level{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33204) Add description for missing options in the all jobmanager/taskmanager options section in document
Zhanghao Chen created FLINK-33204: - Summary: Add description for missing options in the all jobmanager/taskmanager options section in document Key: FLINK-33204 URL: https://issues.apache.org/jira/browse/FLINK-33204 Project: Flink Issue Type: Technical Debt Components: Runtime / Configuration Affects Versions: 1.17.0, 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 There are 4 options which are excluded from the all jobmanager/taskmanager options section in the configuration document: # taskmanager.bind-host # taskmanager.rpc.bind-port # jobmanager.bind-host # jobmanager.rpc.bind-port We should add them to the document under the all jobmanager/taskmanager options section for completeness. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33205) Replace Akka with Pekko in the description of "pekko.ssl.enabled"
Zhanghao Chen created FLINK-33205: - Summary: Replace Akka with Pekko in the description of "pekko.ssl.enabled" Key: FLINK-33205 URL: https://issues.apache.org/jira/browse/FLINK-33205 Project: Flink Issue Type: Technical Debt Components: Runtime / Configuration Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33221) Add config options for administrator JVM options
Zhanghao Chen created FLINK-33221: - Summary: Add config options for administrator JVM options Key: FLINK-33221 URL: https://issues.apache.org/jira/browse/FLINK-33221 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Reporter: Zhanghao Chen We encounter similar issues described in SPARK-23472. Users may need to add JVM options to their Flink applications (e.g. to tune GC options). They typically use {{env.java.opts.x}} series of options to do so. We also have a set of administrator JVM options to apply by default, e.g. to enable GC log, tune GC options, etc. Both use cases will need to set the same series of options and will clobber one another. In the past, we generated and pretended to the administrator JVM options in the Java code for generating the starting command for JM/TM. However, this has been proven to be difficult to maintain. Therefore, I propose to also add a set of default JVM options for administrator use that prepends the user-set extra JVM options. We can mark the existing {{env.java.opts.x}} series as user-set extra JVM options and add a set of new {{env.java.opts.x.default}} options for administrator use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33236) Remove the unused high-availability.zookeeper.path.running-registry option
Zhanghao Chen created FLINK-33236: - Summary: Remove the unused high-availability.zookeeper.path.running-registry option Key: FLINK-33236 URL: https://issues.apache.org/jira/browse/FLINK-33236 Project: Flink Issue Type: Technical Debt Components: Runtime / Configuration Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 The running registry subcomponent of Flink HA has been removed in [FLINK-25430|https://issues.apache.org/jira/browse/FLINK-25430] and the "high-availability.zookeeper.path.running-registry" option is of no use after that. We should remove the option and regenerate the config doc to remove the relevant descriptions to avoid user's confusion. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33240) Generate docs for deprecated options as well
Zhanghao Chen created FLINK-33240: - Summary: Generate docs for deprecated options as well Key: FLINK-33240 URL: https://issues.apache.org/jira/browse/FLINK-33240 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Zhanghao Chen Fix For: 1.19.0 Currently, Flink will skip doc generation for deprecated options (See {{{}ConfigOptionsDocGenerator#{}}}{{{}shouldBeDocumented{}}}). As a result, the deprecated options can no longer be found in the new version of Flink document. This might confuse users upgrading from an older version of Flink and they have to either carefully read the release notes or check the source code for upgrading guidance on deprecated options. I suggest generating doc for deprecated options as well, and we should scan through the code to make sure that proper upgrading guidance is provided for the deprecated options. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33261) FLIP-367: Support Setting Parallelism for Table/SQL Sources
Zhanghao Chen created FLINK-33261: - Summary: FLIP-367: Support Setting Parallelism for Table/SQL Sources Key: FLINK-33261 URL: https://issues.apache.org/jira/browse/FLINK-33261 Project: Flink Issue Type: New Feature Components: Table SQL / API Reporter: Zhanghao Chen Umbrella issue for [FLIP-367: Support Setting Parallelism for Table/SQL Sources - Apache Flink - Apache Software Foundation|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263429150]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33262) Extend source provider interfaces with the new parallelism provider interface
Zhanghao Chen created FLINK-33262: - Summary: Extend source provider interfaces with the new parallelism provider interface Key: FLINK-33262 URL: https://issues.apache.org/jira/browse/FLINK-33262 Project: Flink Issue Type: Sub-task Components: Table SQL / API Reporter: Zhanghao Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33263) Implement ParallelismProvider for sources in Blink planner
Zhanghao Chen created FLINK-33263: - Summary: Implement ParallelismProvider for sources in Blink planner Key: FLINK-33263 URL: https://issues.apache.org/jira/browse/FLINK-33263 Project: Flink Issue Type: Sub-task Components: Table SQL / Planner Reporter: Zhanghao Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33264) Support source parallelism setting for DataGen connector
Zhanghao Chen created FLINK-33264: - Summary: Support source parallelism setting for DataGen connector Key: FLINK-33264 URL: https://issues.apache.org/jira/browse/FLINK-33264 Project: Flink Issue Type: Sub-task Reporter: Zhanghao Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33265) Support source parallelism setting for Kafka connector
Zhanghao Chen created FLINK-33265: - Summary: Support source parallelism setting for Kafka connector Key: FLINK-33265 URL: https://issues.apache.org/jira/browse/FLINK-33265 Project: Flink Issue Type: Sub-task Components: Connectors / Kafka Reporter: Zhanghao Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33681) Display source/sink numRecordsIn/Out & numBytesIn/Out on UI
Zhanghao Chen created FLINK-33681: - Summary: Display source/sink numRecordsIn/Out & numBytesIn/Out on UI Key: FLINK-33681 URL: https://issues.apache.org/jira/browse/FLINK-33681 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Reporter: Zhanghao Chen Attachments: image-2023-11-29-13-26-15-176.png Currently, the numRecordsIn & numBytesIn metrics for sources and the numRecordsOut & numBytesOut metrics for sinks are always 0 on the Flink web dashboard. [FLINK-11576|https://issues.apache.org/jira/browse/FLINK-11576] brings us these metrics on the opeartor level, but it does not integrate them on the task level. On the other hand, the summay metrics on the job overview page is based on the task level I/O metrics. As a result, even though new connectors supporting FLIP-33 metrics will report operator-level I/O metrics, we still cannot see the metrics on dashboard. This ticket serves as an umbrella issue to integrate standard source/sink I/O metrics with the corresponding task I/O metrics. !image-2023-11-29-13-26-15-176.png|width=590,height=252! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33682) Reuse source operator input records/bytes metrics for SourceOperatorStreamTask
Zhanghao Chen created FLINK-33682: - Summary: Reuse source operator input records/bytes metrics for SourceOperatorStreamTask Key: FLINK-33682 URL: https://issues.apache.org/jira/browse/FLINK-33682 Project: Flink Issue Type: Sub-task Components: Runtime / Metrics Reporter: Zhanghao Chen For SourceOperatorStreamTask, source opeartor is the head operator that takes input. We can directly reuse source operator input records/bytes metrics for it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33891) Remove the obsolete SingleJobGraphStore
Zhanghao Chen created FLINK-33891: - Summary: Remove the obsolete SingleJobGraphStore Key: FLINK-33891 URL: https://issues.apache.org/jira/browse/FLINK-33891 Project: Flink Issue Type: Technical Debt Components: Runtime / Coordination Reporter: Zhanghao Chen SingleJobGraphStore was introduced a long time ago in FLIP-6. It is only used in a test case in DefaultDispatcherRunnerITCase# leaderChange_withBlockingJobManagerTermination_doesNotAffectNewLeader. We can replace it with TestingJobGraphStore there and then safely remove the class. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33940) Update Update the auto-derivation rule of max parallelism for enlarged upscaling space
Zhanghao Chen created FLINK-33940: - Summary: Update Update the auto-derivation rule of max parallelism for enlarged upscaling space Key: FLINK-33940 URL: https://issues.apache.org/jira/browse/FLINK-33940 Project: Flink Issue Type: Improvement Components: API / Core Reporter: Zhanghao Chen *Background* The choice of the max parallelism of an stateful operator is important as it limits the upper bound of the parallelism of the opeartor while it can also add extra overhead when being set too large. Currently, the max parallelism of an opeartor is either fixed to a value specified by API core / pipeline option or auto-derived with the following rules: `min(max(roundUpToPowerOfTwo(operatorParallelism * 1.5), 128), 32767)` *Problem* Recently, the elasticity of Flink jobs is becoming more and more valued by users. The current auto-derived max parallelism was introduced a time time ago and only allows the operator parallelism to be roughly doubled, which is not desired for elasticity. Setting an max parallelism manually may not be desired as well: users may not have the sufficient expertise to select a good max-parallelism value. *Proposal* Update the auto-derivation rule of max parallelism to derive larger max parallelism for better elasticity experience out of the box. A candidate is as follows: `min(max(roundUpToPowerOfTwo(operatorParallelism * {*}5{*}), {*}1024{*}), 32767)` Looking forward to your opinions on this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33962) Chaining-agnostic OperatorID generation for improved state compatibility on parallelism change
Zhanghao Chen created FLINK-33962: - Summary: Chaining-agnostic OperatorID generation for improved state compatibility on parallelism change Key: FLINK-33962 URL: https://issues.apache.org/jira/browse/FLINK-33962 Project: Flink Issue Type: Improvement Components: API / Core Reporter: Zhanghao Chen *Background* Flink restores opeartor state from snapshots based on matching the operatorIDs. Since Flink 1.2, {{StreamGraphHasherV2}} is used for operatorID generation when no user-set uid exist. The generated OperatorID is deterministic with respect to: * node-local properties (the traverse ID in the BFS for the DAG) * chained output nodes * input nodes hashes *Problem* The chaining behavior will affect state compatibility, as the generation of the OperatorID of an Op is dependent on its chained output nodes. For example, a simple source->sink DAG with source and sink chained together is state imcompatible with an otherwise identical DAG with source and sink unchained (either because the parallelisms of the two ops are changed to be unequal or chaining is disabled). This greatly limits the flexibility to perform chain-breaking/joining for performance tuning. *Proposal* Introduce ** {{StreamGraphHasherV3}} that is agnostic to the chaining behavior of operators, which effectively just removes L227-235 of [flink/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java at master · apache/flink (github.com)|https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java]. This will not hurt the deteministicity of the ID generation across job submission as long as the stream graph topology doesn't change, and since new versions of Flink have already adopted pure operator-level state recovery, this will not break state recovery across job submission as long as both submissions use the same hasher. This will, however, breaks cross-version state compatibility. So we can introduce a new option to enable using HasherV3 in v1.19 and consider making it the default hasher in v2.0. Looking forward to suggestions on this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33977) Adaptive scheduler may not minimize the number of TMs during downscaling
Zhanghao Chen created FLINK-33977: - Summary: Adaptive scheduler may not minimize the number of TMs during downscaling Key: FLINK-33977 URL: https://issues.apache.org/jira/browse/FLINK-33977 Project: Flink Issue Type: Improvement Affects Versions: 1.18.0 Reporter: Zhanghao Chen Adaptive Scheduler uses SlotAssigner to assign free slots to slot sharing groups. Currently, there're two implementations of SlotAssigner available: the DefaultSlotAssigner that treats all slots and slot sharing groups equally and the {color:#172b4d}StateLocalitySlotAssigner{color} that assigns slots based on the number of local key groups to utilize local state recovery. The scheduler will use the DefaultSlotAssigner when no key group assignment info is available and use the StateLocalitySlotAssigner otherwise. However, none of the SlotAssigner targets at minimizing the number of TMs, which may produce suboptimal slot assignment under the Application Mode. For example, when a job with 8 slot sharing groups and 2 TMs (each 4 slots) is downscaled through the FLIP-291 API to have 4 slot sharing groups instead, the cluster may still have 2 TMs, one with 1 free slot, and the other with 3 free slots. For end-users, this implies an ineffective downscaling as the total cluster resources are not reduced. We should take minimizing number of TMs into consideration as well. A possible approach is to enhance the {color:#172b4d}StateLocalitySlotAssigner: when the number of free slots exceeds need, sort all the TMs by a score summing from the allocation scores of all slots on it, remove slots from the excessive TMs with the lowest score and proceed the remaining slot assignment.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)