from:"Ron Liu"

Re: Re: [VOTE] FLIP-389: Annotate SingleThreadFetcherManager as PublicEvolving

2024-01-19 Thread Ron liu

+1（binding）

Best,
Ron

Xuyang  于2024年1月19日周五 14:00写道：

> +1 (non-binding)
>
>
> --
>
> Best！
> Xuyang
>
>
>
>
>
> 在 2024-01-19 10:16:23，"Qingsheng Ren"  写道：
> >+1 (binding)
> >
> >Thanks for the work, Hongshun!
> >
> >Best,
> >Qingsheng
> >
> >On Tue, Jan 16, 2024 at 11:21 AM Leonard Xu  wrote:
> >
> >> Thanks Hongshun for driving this !
> >>
> >> +1(binding)
> >>
> >> Best,
> >> Leonard
> >>
> >> > 2024年1月3日 下午8:04，Hongshun Wang  写道：
> >> >
> >> > Dear Flink Developers,
> >> >
> >> > Thank you for providing feedback on FLIP-389: Annotate
> >> > SingleThreadFetcherManager as PublicEvolving[1] on the discussion
> >> > thread[2]. The goal of the FLIP is as follows:
> >> >
> >> >   - To expose the SplitFetcherManager / SingleThreadFetcheManager as
> >> >   Public, allowing connector developers to easily create their own
> >> threading
> >> >   models in the SourceReaderBase by implementing addSplits(),
> >> removeSplits(),
> >> >   maybeShutdownFinishedFetchers() and other functions.
> >> >   - To hide the element queue from the connector developers and
> simplify
> >> >   the SourceReaderBase to consist of only SplitFetcherManager and
> >> >   RecordEmitter as major components.
> >> >
> >> >
> >> > Any additional questions regarding this FLIP? Looking forward to
> hearing
> >> > from you.
> >> >
> >> >
> >> > Thanks,
> >> > Hongshun Wang
> >> >
> >> >
> >> > [1]
> >> >
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465498
> >> >
> >> > [2] https://lists.apache.org/thread/b8f509878ptwl3kmmgg95tv8sb1j5987
> >>
> >>
>

Re: Temporal join on rolling aggregate

2024-02-25 Thread Ron liu

+1,
But I think this should be a more general requirement, that is, support for
declaring watermarks in query, which can be declared for any type of
source, such as table, view. Similar to databricks provided [1], this needs
a FLIP.

[1]
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-watermark.html

Best,
Ron

Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun

2024-02-25 Thread Ron liu

Congratulations, Jiabao!

Best,
Ron

Yun Tang  于2024年2月23日周五 19:59写道：

> Congratulations, Jiabao!
>
> Best
> Yun Tang
> 
> From: Weihua Hu 
> Sent: Thursday, February 22, 2024 17:29
> To: dev@flink.apache.org 
> Subject: Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun
>
> Congratulations, Jiabao!
>
> Best,
> Weihua
>
>
> On Thu, Feb 22, 2024 at 10:34 AM Jingsong Li 
> wrote:
>
> > Congratulations! Well deserved!
> >
> > On Wed, Feb 21, 2024 at 4:36 PM Yuepeng Pan 
> wrote:
> > >
> > > Congratulations~ :)
> > >
> > > Best,
> > > Yuepeng Pan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 在 2024-02-21 09:52:17，"Hongshun Wang"  写道：
> > > >Congratulations, Jiabao :)
> > > >Congratulations Jiabao!
> > > >
> > > >Best,
> > > >Hongshun
> > > >Best regards,
> > > >
> > > >Weijie
> > > >
> > > >On Tue, Feb 20, 2024 at 2:19 PM Runkang He  wrote:
> > > >
> > > >> Congratulations Jiabao!
> > > >>
> > > >> Best,
> > > >> Runkang He
> > > >>
> > > >> Jane Chan  于2024年2月20日周二 14:18写道：
> > > >>
> > > >> > Congrats, Jiabao!
> > > >> >
> > > >> > Best,
> > > >> > Jane
> > > >> >
> > > >> > On Tue, Feb 20, 2024 at 10:32 AM Paul Lam 
> > wrote:
> > > >> >
> > > >> > > Congrats, Jiabao!
> > > >> > >
> > > >> > > Best,
> > > >> > > Paul Lam
> > > >> > >
> > > >> > > > 2024年2月20日 10:29，Zakelly Lan  写道：
> > > >> > > >
> > > >> > > >> Congrats! Jiabao!
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> >
>

[DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-11 Thread Ron liu

Hi, Dev


Lincoln Lee and I would like to start a discussion about FLIP-435:
Introduce a  New Dynamic Table for Simplifying Data Pipelines.


This FLIP is designed to simplify the development of data processing
pipelines. With Dynamic Tables with uniform SQL statements and
freshness, users can define batch and streaming transformations to
data in the same way, accelerate ETL pipeline development, and manage
task scheduling automatically.


For more details, see FLIP-435 [1]. Looking forward to your feedback.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Dynamic+Table+for+Simplifying+Data+Pipelines

Best,

Lincoln & Ron

Re: [VOTE] Release 1.19.0, release candidate #2

2024-03-11 Thread Ron liu

+1 (non binding)

quickly verified:
- verified that source distribution does not contain binaries
- verified checksums
- built source code successfully


Best,
Ron

Jeyhun Karimov  于2024年3月12日周二 01:00写道：

> +1 (non binding)
>
> - verified that source distribution does not contain binaries
> - verified signatures and checksums
> - built source code successfully
>
> Regards,
> Jeyhun
>
>
> On Mon, Mar 11, 2024 at 3:08 PM Samrat Deb  wrote:
>
> > +1 (non binding)
> >
> > - verified signatures and checksums
> > - ASF headers are present in all expected file
> > - No unexpected binaries files found in the source
> > - Build successful locally
> > - tested basic word count example
> >
> >
> >
> >
> > Bests,
> > Samrat
> >
> > On Mon, 11 Mar 2024 at 7:33 PM, Ahmed Hamdy 
> wrote:
> >
> > > Hi Lincoln
> > > +1 (non-binding) from me
> > >
> > > - Verified Checksums & Signatures
> > > - Verified Source dists don't contain binaries
> > > - Built source successfully
> > > - reviewed web PR
> > >
> > >
> > > Best Regards
> > > Ahmed Hamdy
> > >
> > >
> > > On Mon, 11 Mar 2024 at 15:18, Lincoln Lee 
> > wrote:
> > >
> > > > Hi Robin,
> > > >
> > > > Thanks for helping verifying the release note[1], FLINK-14879 should
> > not
> > > > have been included, after confirming this
> > > > I moved all unresolved non-blocker issues left over from 1.19.0 to
> > 1.20.0
> > > > and reconfigured the release note [1].
> > > >
> > > > Best,
> > > > Lincoln Lee
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
> > > >
> > > >
> > > > Robin Moffatt  于2024年3月11日周一 19:36写道：
> > > >
> > > > > Looking at the release notes [1] it lists `DESCRIBE DATABASE`
> > > > (FLINK-14879)
> > > > > and `DESCRIBE CATALOG` (FLINK-14690).
> > > > > When I try these in 1.19 RC2 the behaviour is as in 1.18.1, i.e. it
> > is
> > > > not
> > > > > supported:
> > > > >
> > > > > ```
> > > > > [INFO] Execute statement succeed.
> > > > >
> > > > > Flink SQL> show catalogs;
> > > > > +-+
> > > > > |catalog name |
> > > > > +-+
> > > > > |   c_new |
> > > > > | default_catalog |
> > > > > +-+
> > > > > 2 rows in set
> > > > >
> > > > > Flink SQL> DESCRIBE CATALOG c_new;
> > > > > [ERROR] Could not execute SQL statement. Reason:
> > > > > org.apache.calcite.sql.validate.SqlValidatorException: Column
> 'c_new'
> > > not
> > > > > found in any table
> > > > >
> > > > > Flink SQL> show databases;
> > > > > +--+
> > > > > |database name |
> > > > > +--+
> > > > > | default_database |
> > > > > +--+
> > > > > 1 row in set
> > > > >
> > > > > Flink SQL> DESCRIBE DATABASE default_database;
> > > > > [ERROR] Could not execute SQL statement. Reason:
> > > > > org.apache.calcite.sql.validate.SqlValidatorException: Column
> > > > > 'default_database' not found in
> > > > > any table
> > > > > ```
> > > > >
> > > > > Is this an error in the release notes, or my mistake in
> interpreting
> > > > them?
> > > > >
> > > > > thanks, Robin.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
> > > > >
> > > > > On Thu, 7 Mar 2024 at 10:01, Lincoln Lee 
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Please review and vote on the release candidate #2 for the
> version
> > > > > 1.19.0,
> > > > > > as follows:
> > > > > > [ ] +1, Approve the release
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > > The complete staging area is available for your review, which
> > > includes:
> > > > > >
> > > > > > * JIRA release notes [1], and the pull request adding release
> note
> > > for
> > > > > > users [2]
> > > > > > * the official Apache source release and binary convenience
> > releases
> > > to
> > > > > be
> > > > > > deployed to dist.apache.org [3], which are signed with the key
> > with
> > > > > > fingerprint E57D30ABEE75CA06  [4],
> > > > > > * all artifacts to be deployed to the Maven Central Repository
> [5],
> > > > > > * source code tag "release-1.19.0-rc2" [6],
> > > > > > * website pull request listing the new release and adding
> > > announcement
> > > > > blog
> > > > > > post [7].
> > > > > >
> > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > majority
> > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
> > > > > > [2] https://github.com/apache/flink/pull/24394
> > > > > > [3]
> https://dist.apache.org/repos/dist/dev/flink/flink-1.19.0-rc2/
> > > > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > [5]
> > > > >
> > https://repository.apache.org/content/repo

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-13 Thread Ron liu

es, I have some other questions.
> 1. How to define query of dynamic table?
> Use flink sql or introducing new syntax?
> If use flink sql, how to handle the difference in SQL between streaming and
> batch processing?
> For example, a query including window aggregate based on processing time?
> or a query including global order by?
>
> 2. Whether modify the query of dynamic table is allowed?
> Or we could only refresh a dynamic table based on initial query?
>
> 3. How to use dynamic table?
> The dynamic table seems to be similar with materialized view.  Will we do
> something like materialized view rewriting during the optimization?
>
> Best,
> Jing Zhang
>
>
> Timo Walther  于2024年3月13日周三 01:24写道：
>
> > Hi Lincoln & Ron,
> >
> > thanks for proposing this FLIP. I think a design similar to what you
> > propose has been in the heads of many people, however, I'm wondering how
> > this will fit into the bigger picture.
> >
> > I haven't deeply reviewed the FLIP yet, but would like to ask some
> > initial questions:
> >
> > Flink has introduced the concept of Dynamic Tables many years ago. How
> > does the term "Dynamic Table" fit into Flink's regular tables and also
> > how does it relate to Table API?
> >
> > I fear that adding the DYNAMIC TABLE keyword could cause confusion for
> > users, because a term for regular CREATE TABLE (that can be "kind of
> > dynamic" as well and is backed by a changelog) is then missing. Also
> > given that we call our connectors for those tables, DynamicTableSource
> > and DynamicTableSink.
> >
> > In general, I find it contradicting that a TABLE can be "paused" or
> > "resumed". From an English language perspective, this does sound
> > incorrect. In my opinion (without much research yet), a continuous
> > updating trigger should rather be modelled as a CREATE MATERIALIZED VIEW
> > (which users are familiar with?) or a new concept such as a CREATE TASK
> > (that can be paused and resumed?).
> >
> > How do you envision re-adding the functionality of a statement set, that
> > fans out to multiple tables? This is a very important use case for data
> > pipelines.
> >
> > Since the early days of Flink SQL, we were discussing `SELECT STREAM *
> > FROM T EMIT 5 MINUTES`. Your proposal seems to rephrase STREAM and EMIT,
> > into other keywords DYNAMIC TABLE and FRESHNESS. But the core
> > functionality is still there. I'm wondering if we should widen the scope
> > (maybe not part of this FLIP but a new FLIP) to follow the standard more
> > closely. Making `SELECT * FROM t` bounded by default and use new syntax
> > for the dynamic behavior. Flink 2.0 would be the perfect time for this,
> > however, it would require careful discussions. What do you think?
> >
> > Regards,
> > Timo
> >
> >
> > On 11.03.24 08:23, Ron liu wrote:
> > > Hi, Dev
> > >
> > >
> > > Lincoln Lee and I would like to start a discussion about FLIP-435:
> > > Introduce a  New Dynamic Table for Simplifying Data Pipelines.
> > >
> > >
> > > This FLIP is designed to simplify the development of data processing
> > > pipelines. With Dynamic Tables with uniform SQL statements and
> > > freshness, users can define batch and streaming transformations to
> > > data in the same way, accelerate ETL pipeline development, and manage
> > > task scheduling automatically.
> > >
> > >
> > > For more details, see FLIP-435 [1]. Looking forward to your feedback.
> > >
> > >
> > > [1]
> > >
> > >
> > > Best,
> > >
> > > Lincoln & Ron
> > >
> >
> >
>

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Ron liu

Congratulations

Best,
Ron

Yanfei Lei  于2024年3月18日周一 20:01写道：

> Congrats, thanks for the great work!
>
> Sergey Nuyanzin  于2024年3月18日周一 19:30写道：
> >
> > Congratulations, thanks release managers and everyone involved for the
> great work!
> >
> > On Mon, Mar 18, 2024 at 12:15 PM Benchao Li 
> wrote:
> >>
> >> Congratulations! And thanks to all release managers and everyone
> >> involved in this release!
> >>
> >> Yubin Li  于2024年3月18日周一 18:11写道：
> >> >
> >> > Congratulations!
> >> >
> >> > Thanks to release managers and everyone involved.
> >> >
> >> > On Mon, Mar 18, 2024 at 5:55 PM Hangxiang Yu 
> wrote:
> >> > >
> >> > > Congratulations!
> >> > > Thanks release managers and all involved!
> >> > >
> >> > > On Mon, Mar 18, 2024 at 5:23 PM Hang Ruan 
> wrote:
> >> > >
> >> > > > Congratulations!
> >> > > >
> >> > > > Best,
> >> > > > Hang
> >> > > >
> >> > > > Paul Lam  于2024年3月18日周一 17:18写道：
> >> > > >
> >> > > > > Congrats! Thanks to everyone involved!
> >> > > > >
> >> > > > > Best,
> >> > > > > Paul Lam
> >> > > > >
> >> > > > > > 2024年3月18日 16:37，Samrat Deb  写道：
> >> > > > > >
> >> > > > > > Congratulations !
> >> > > > > >
> >> > > > > > On Mon, 18 Mar 2024 at 2:07 PM, Jingsong Li <
> jingsongl...@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > >> Congratulations!
> >> > > > > >>
> >> > > > > >> On Mon, Mar 18, 2024 at 4:30 PM Rui Fan <
> 1996fan...@gmail.com> wrote:
> >> > > > > >>>
> >> > > > > >>> Congratulations, thanks for the great work!
> >> > > > > >>>
> >> > > > > >>> Best,
> >> > > > > >>> Rui
> >> > > > > >>>
> >> > > > > >>> On Mon, Mar 18, 2024 at 4:26 PM Lincoln Lee <
> lincoln.8...@gmail.com>
> >> > > > > >> wrote:
> >> > > > > 
> >> > > > >  The Apache Flink community is very happy to announce the
> release of
> >> > > > > >> Apache Flink 1.19.0, which is the fisrt release for the
> Apache Flink
> >> > > > > 1.19
> >> > > > > >> series.
> >> > > > > 
> >> > > > >  Apache Flink® is an open-source stream processing
> framework for
> >> > > > > >> distributed, high-performing, always-available, and accurate
> data
> >> > > > > streaming
> >> > > > > >> applications.
> >> > > > > 
> >> > > > >  The release is available for download at:
> >> > > > >  https://flink.apache.org/downloads.html
> >> > > > > 
> >> > > > >  Please check out the release blog post for an overview of
> the
> >> > > > > >> improvements for this bugfix release:
> >> > > > > 
> >> > > > > >>
> >> > > > >
> >> > > >
> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
> >> > > > > 
> >> > > > >  The full release notes are available in Jira:
> >> > > > > 
> >> > > > > >>
> >> > > > >
> >> > > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
> >> > > > > 
> >> > > > >  We would like to thank all contributors of the Apache Flink
> >> > > > community
> >> > > > > >> who made this release possible!
> >> > > > > 
> >> > > > > 
> >> > > > >  Best,
> >> > > > >  Yun, Jing, Martijn and Lincoln
> >> > > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best,
> >> > > Hangxiang.
> >>
> >>
> >>
> >> --
> >>
> >> Best,
> >> Benchao Li
> >
> >
> >
> > --
> > Best regards,
> > Sergey
>
>
>
> --
> Best,
> Yanfei
>

Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

2024-03-20 Thread Ron liu

Congratulations!

Best,
Ron

Jark Wu  于2024年3月21日周四 10:46写道：

> Congratulations and welcome!
>
> Best,
> Jark
>
> On Thu, 21 Mar 2024 at 10:35, Rui Fan <1996fan...@gmail.com> wrote:
>
> > Congratulations!
> >
> > Best,
> > Rui
> >
> > On Thu, Mar 21, 2024 at 10:25 AM Hang Ruan 
> wrote:
> >
> > > Congrattulations!
> > >
> > > Best,
> > > Hang
> > >
> > > Lincoln Lee  于2024年3月21日周四 09:54写道：
> > >
> > >>
> > >> Congrats, thanks for the great work!
> > >>
> > >>
> > >> Best,
> > >> Lincoln Lee
> > >>
> > >>
> > >> Peter Huang  于2024年3月20日周三 22:48写道：
> > >>
> > >>> Congratulations
> > >>>
> > >>>
> > >>> Best Regards
> > >>> Peter Huang
> > >>>
> > >>> On Wed, Mar 20, 2024 at 6:56 AM Huajie Wang 
> > wrote:
> > >>>
> > 
> >  Congratulations
> > 
> > 
> > 
> >  Best,
> >  Huajie Wang
> > 
> > 
> > 
> >  Leonard Xu  于2024年3月20日周三 21:36写道：
> > 
> > > Hi devs and users,
> > >
> > > We are thrilled to announce that the donation of Flink CDC as a
> > > sub-project of Apache Flink has completed. We invite you to explore
> > the new
> > > resources available:
> > >
> > > - GitHub Repository: https://github.com/apache/flink-cdc
> > > - Flink CDC Documentation:
> > > https://nightlies.apache.org/flink/flink-cdc-docs-stable
> > >
> > > After Flink community accepted this donation[1], we have completed
> > > software copyright signing, code repo migration, code cleanup,
> > website
> > > migration, CI migration and github issues migration etc.
> > > Here I am particularly grateful to Hang Ruan, Zhongqaing Gong,
> > > Qingsheng Ren, Jiabao Sun, LvYanquan, loserwang1024 and other
> > contributors
> > > for their contributions and help during this process！
> > >
> > >
> > > For all previous contributors: The contribution process has
> slightly
> > > changed to align with the main Flink project. To report bugs or
> > suggest new
> > > features, please open tickets
> > > Apache Jira (https://issues.apache.org/jira).  Note that we will
> no
> > > longer accept GitHub issues for these purposes.
> > >
> > >
> > > Welcome to explore the new repository and documentation. Your
> > feedback
> > > and contributions are invaluable as we continue to improve Flink
> CDC.
> > >
> > > Thanks everyone for your support and happy exploring Flink CDC!
> > >
> > > Best，
> > > Leonard
> > > [1]
> https://lists.apache.org/thread/cw29fhsp99243yfo95xrkw82s5s418ob
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-22 Thread Ron liu

hould consider the
> > lineage
> > > >> in
> > > >>> the catalog or somewhere else.
> > > >>>
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-134%3A+Batch+execution+for+the+DataStream+API
> > > >>> [2] https://docs.snowflake.com/en/user-guide/dynamic-tables-about
> > > >>> [3]
> https://docs.snowflake.com/en/user-guide/dynamic-tables-refresh
> > > >>>
> > > >>> Best
> > > >>> Yun Tang
> > > >>>
> > > >>>
> > > >>> 
> > > >>> From: Lincoln Lee 
> > > >>> Sent: Thursday, March 14, 2024 14:35
> > > >>> To: dev@flink.apache.org 
> > > >>> Subject: Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for
> > > >>> Simplifying Data Pipelines
> > > >>>
> > > >>> Hi Jing,
> > > >>>
> > > >>> Thanks for your attention to this flip! I'll try to answer the
> > > following
> > > >>> questions.
> > > >>>
> > > >>>> 1. How to define query of dynamic table?
> > > >>>> Use flink sql or introducing new syntax?
> > > >>>> If use flink sql, how to handle the difference in SQL between
> > > streaming
> > > >>> and
> > > >>>> batch processing?
> > > >>>> For example, a query including window aggregate based on
> processing
> > > >> time?
> > > >>>> or a query including global order by?
> > > >>>
> > > >>> Similar to `CREATE TABLE AS query`, here the `query` also uses
> Flink
> > > sql
> > > >>> and
> > > >>>
> > > >>> doesn't introduce a totally new syntax.
> > > >>> We will not change the status respect to
> > > >>>
> > > >>> the difference in functionality of flink sql itself on streaming
> and
> > > >>> batch, for example,
> > > >>>
> > > >>> the proctime window agg on streaming and global sort on batch that
> > you
> > > >>> mentioned,
> > > >>>
> > > >>> in fact, do not work properly in the
> > > >>> other mode, so when the user modifies the
> > > >>>
> > > >>> refresh mode of a dynamic table that is not supported, we will
> throw
> > an
> > > >>> exception.
> > > >>>
> > > >>>> 2. Whether modify the query of dynamic table is allowed?
> > > >>>> Or we could only refresh a dynamic table based on the initial
> query?
> > > >>>
> > > >>> Yes, in the current design, the query definition of the
> > > >>> dynamic table is not allowed
> > > >>>
> > > >>>   to be modified, and you can only refresh the data based on the
> > > >>> initial definition.
> > > >>>
> > > >>>> 3. How to use dynamic table?
> > > >>>> The dynamic table seems to be similar to the materialized view.
> > Will
> > > >> we
> > > >>> do
> > > >>>> something like materialized view rewriting during the
> optimization?
> > > >>>
> > > >>> It's true that dynamic table and materialized view
> > > >>> are similar in some ways, but as Ron
> > > >>>
> > > >>> explains
> > > >>> there are differences. In terms of optimization, automated
> > > >>> materialization discovery
> > > >>>
> > > >>> similar to that supported by calcite is also a potential
> possibility,
> > > >>> perhaps with the
> > > >>>
> > > >>> addition of automated rewriting in the future.
> > > >>>
> > > >>>
> > > >>>
> > > >>> Best,
> > > >>> Lincoln Lee
> > > >>>
> > > >>>
> > > >>> Ron liu  于2024年3月14日周四 14:01写道：
> > > >>>
> > > >>>> Hi, Timo
> > > >>>>
>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-24 Thread Ron liu

Hi, Ahmed

Thanks for your feedback.

Regarding your question:

> I want to iterate on Timo's comments regarding the confusion between
"Dynamic Table" and current Flink "Table". Should the refactoring of the
system happen in 2.0, should we rename it in this Flip ( as the suggestions
in the thread ) and address the holistic changes in a separate Flip for 2.0?

Lincoln proposed a new concept in reply to Timo: Derived Table, which is a
combination of Dynamic Table + Continuous Query, and the use of Derived
Table will not conflict with existing concepts, what do you think?

> I feel confused with how it is further with other components, the
examples provided feel like a standalone ETL job, could you provide in the
FLIP an example where the table is further used in subsequent queries
(specially in batch mode).

Thanks for your suggestion, I added how to use Dynamic Table in FLIP user
story section, Dynamic Table can be referenced by downstream Dynamic Table
and can also support OLAP queries.

Best,
Ron

Ron liu  于2024年3月23日周六 10:35写道：

> Hi, Feng
>
> Thanks for your feedback.
>
> > Although currently we restrict users from modifying the query, I wonder
> if
> we can provide a better way to help users rebuild it without affecting
> downstream OLAP queries.
>
> Considering the problem of data consistency, so in the first step we are
> strictly limited in semantics and do not support modify the query. This is
> really a good problem, one of my ideas is to introduce a syntax similar to
> SWAP [1], which supports exchanging two Dynamic Tables.
>
> > From the documentation, the definitions SQL and job information are
> stored in the Catalog. Does this mean that if a system needs to adapt to
> Dynamic Tables, it also needs to store Flink's job information in the
> corresponding system?
> For example, does MySQL's Catalog need to store flink job information as
> well?
>
> Yes, currently we need to rely on Catalog to store refresh job information.
>
> > Users still need to consider how much memory is being used, how large
> the concurrency is, which type of state backend is being used, and may need
> to set TTL expiration.
>
> Similar to the current practice, job parameters can be set via the Flink
> conf or SET commands
>
> > When we submit a refresh command, can we help users detect if there are
> any
> running jobs and automatically stop them before executing the refresh
> command? Then wait for it to complete before restarting the background
> streaming job?
>
> Purely from a technical implementation point of view, your proposal is
> doable, but it would be more costly. Also I think data consistency itself
> is the responsibility of the user, similar to how Regular Table is now also
> the responsibility of the user, so it's consistent with its behavior and no
> additional guarantees are made at the engine level.
>
> Best,
> Ron
>
>
> Ahmed Hamdy  于2024年3月22日周五 23:50写道：
>
>> Hi Ron,
>> Sorry for joining the discussion late, thanks for the effort.
>>
>> I think the base idea is great, however I have a couple of comments:
>> - I want to iterate on Timo's comments regarding the confusion between
>> "Dynamic Table" and current Flink "Table". Should the refactoring of the
>> system happen in 2.0, should we rename it in this Flip ( as the
>> suggestions
>> in the thread ) and address the holistic changes in a separate Flip for
>> 2.0?
>> - I feel confused with how it is further with other components, the
>> examples provided feel like a standalone ETL job, could you provide in the
>> FLIP an example where the table is further used in subsequent queries
>> (specially in batch mode).
>> - I really like the standard of keeping the unified batch and streaming
>> approach
>> Best Regards
>> Ahmed Hamdy
>>
>>
>> On Fri, 22 Mar 2024 at 12:07, Lincoln Lee  wrote:
>>
>> > Hi Timo,
>> >
>> > Thanks for your thoughtful inputs!
>> >
>> > Yes, expanding the MATERIALIZED VIEW(MV) could achieve the same
>> function,
>> > but our primary concern is that by using a view, we might limit future
>> > opportunities
>> > to optimize queries through automatic materialization rewriting [1],
>> > leveraging
>> > the support for MV by physical storage. This is because we would be
>> > breaking
>> > the intuitive semantics of a materialized view (a materialized view
>> > represents
>> > the result of a query) by allowing data modifications, thus losing the
>> > potential
>> > for such optimizations.
>> >
>> > With these considerati

Re: Support minibatch for TopNFunction

2024-03-25 Thread Ron liu

Hi, Roman

Thanks for your proposal, I intuitively feel that this optimization would
be very useful to reduce the amount of message amplification for TopN
operators. After briefly looking at your google docs, I have the following
questions:

1. Whether you can describe in detail the principle of solving the TopN
operator record amplification, similar to Minibatch Join[1], through the
figure of current Motivation part, I can not understand how you did it
2. TopN has currently multiple implementation functions, including
AppendOnlyFirstNFunction, AppendOnlyTopNFunction, FastTop1Function,
RetractableTopNFunction, UpdatableTopNFunction. Is it possible to elaborate
on which patterns the Minibatch optimization applies to?
3. Is it possible to provide the PoC code?
4. finally, we need a formal FLIP document on the wiki[2].

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#minibatch-regular-joins
[2]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Best,
Ron

Roman Boyko  于2024年3月24日周日 01:14写道：

> Hi Flink Community,
>
> I tried to describe my idea about minibatch for TopNFunction in this doc -
>
> https://docs.google.com/document/d/1YPHwxKfiGSUOUOa6bc68fIJHO_UojTwZEC29VVEa-Uk/edit?usp=sharing
>
> Looking forward to your feedback, thank you
>
> On Tue, 19 Mar 2024 at 12:24, Roman Boyko  wrote:
>
> > Hello Flink Community,
> >
> > The same problem with record amplification as described in FLIP-415:
> Introduce
> > a new join operator to support minibatch[1] exists for most of
> > implementations of AbstractTopNFunction. Especially when the rank is
> > provided to output. For example, when calculating Top100 with rank
> output,
> > every input record might produce 100 -U records and 100 +U records.
> >
> > According to my POC (which is similar to FLIP-415) the record
> > amplification could be significantly reduced by using input or output
> > buffer.
> >
> > What do you think if we implement such optimization for TopNFunctions?
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-415
> > %3A+Introduce+a+new+join+operator+to+support+minibatch
> >
> > --
> > Best regards,
> > Roman Boyko
> > e.: ro.v.bo...@gmail.com
> > m.: +79059592443
> > telegram: @rboyko
> >
>
>
> --
> Best regards,
> Roman Boyko
> e.: ro.v.bo...@gmail.com
> m.: +79059592443
> telegram: @rboyko
>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-25 Thread Ron liu

Hi, Timo

Thanks for your quick response, and your suggestion.

Yes, this discussion has turned into confirming whether it's a special
table or a special MV.

1. The key problem with MVs is that they don't support modification, so I
prefer it to be a special table. Although the periodic refresh behavior is
more characteristic of an MV, since we are already a special table,
supporting periodic refresh behavior is quite natural, similar to Snowflake
dynamic tables.

2. Regarding the keyword UPDATING, since the current Regular Table is a
Dynamic Table, which implies support for updating through Continuous Query,
I think it is redundant to add the keyword UPDATING. In addition, UPDATING
can not reflect the Continuous Query part, can not express the purpose we
want to simplify the data pipeline through Dynamic Table + Continuous Query.

3. From the perspective of the SQL standard definition, I can understand
your concerns about Derived Table, but is it possible to make a slight
adjustment to meet our needs? Additionally, as Lincoln mentioned, the
Google Looker platform has introduced Persistent Derived Table, and there
are precedents in the industry; could Derived Table be a candidate?

Of course, look forward to your better suggestions.

Best,
Ron



Timo Walther  于2024年3月25日周一 18:49写道：

> After thinking about this more, this discussion boils down to whether
> this is a special table or a special materialized view. In both cases,
> we would need to add a special keyword:
>
> Either
>
> CREATE UPDATING TABLE
>
> or
>
> CREATE UPDATING MATERIALIZED VIEW
>
> I still feel that the periodic refreshing behavior is closer to a MV. If
> we add a special keyword to MV, the optimizer would know that the data
> cannot be used for query optimizations.
>
> I will ask more people for their opinion.
>
> Regards,
> Timo
>
>
> On 25.03.24 10:45, Timo Walther wrote:
> > Hi Ron and Lincoln,
> >
> > thanks for the quick response and the very insightful discussion.
> >
> >  > we might limit future opportunities to optimize queries
> >  > through automatic materialization rewriting by allowing data
> >  > modifications, thus losing the potential for such optimizations.
> >
> > This argument makes a lot of sense to me. Due to the updates, the system
> > is not in full control of the persisted data. However, the system is
> > still in full control of the job that powers the refresh. So if the
> > system manages all updating pipelines, it could still leverage automatic
> > materialization rewriting but without leveraging the data at rest (only
> > the data in flight).
> >
> >  > we are considering another candidate, Derived Table, the term 'derive'
> >  > suggests a query, and 'table' retains modifiability. This approach
> >  > would not disrupt our current concept of a dynamic table
> >
> > I did some research on this term. The SQL standard uses the term
> > "derived table" extensively (defined in section 4.17.3). Thus, a lot of
> > vendors adopt this for simply referring to a table within a subclause:
> >
> > https://dev.mysql.com/doc/refman/8.0/en/derived-tables.html
> >
> >
> https://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc32300.1600/doc/html/san1390612291252.html
> >
> >
> https://www.c-sharpcorner.com/article/derived-tables-vs-common-table-expressions/
> >
> >
> https://stackoverflow.com/questions/26529804/what-are-the-derived-tables-in-my-explain-statement
> >
> > https://www.sqlservercentral.com/articles/sql-derived-tables
> >
> > Esp. the latter example is interesting, SQL Server allows things like
> > this on derived tables:
> >
> > UPDATE T SET Name='Timo' FROM (SELECT * FROM Product) AS T
> >
> > SELECT * FROM Product;
> >
> > Btw also Snowflake's dynamic table state:
> >
> >  > Because the content of a dynamic table is fully determined
> >  > by the given query, the content cannot be changed by using DML.
> >  > You don’t insert, update, or delete the rows in a dynamic table.
> >
> > So a new term makes a lot of sense.
> >
> > How about using `UPDATING`?
> >
> > CREATE UPDATING TABLE
> >
> > This reflects that modifications can be made and from an
> > English-language perspective you can PAUSE or RESUME the UPDATING.
> > Thus, a user can define UPDATING interval and mode?
> >
> > Looking forward to your thoughts.
> >
> > Regards,
> > Timo
> >
> >
> > On 25.03.24 07:09, Ron liu wrote:
> >> Hi, Ahmed
> >>
> >> Thanks for your feedback.
> >&

Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-04-01 Thread Ron liu

Congratulations!

Best,
Ron

Jeyhun Karimov  于2024年4月1日周一 18:12写道：

> Congratulations!
>
> Regards,
> Jeyhun
>
> On Mon, Apr 1, 2024 at 7:43 AM Guowei Ma  wrote:
>
> > Congratulations!
> > Best,
> > Guowei
> >
> >
> > On Mon, Apr 1, 2024 at 11:15 AM Feng Jin  wrote:
> >
> > > Congratulations!
> > >
> > > Best,
> > > Feng Jin
> > >
> > > On Mon, Apr 1, 2024 at 10:51 AM weijie guo 
> > > wrote:
> > >
> > >> Congratulations!
> > >>
> > >> Best regards,
> > >>
> > >> Weijie
> > >>
> > >>
> > >> Hang Ruan  于2024年4月1日周一 09:49写道：
> > >>
> > >> > Congratulations!
> > >> >
> > >> > Best,
> > >> > Hang
> > >> >
> > >> > Lincoln Lee  于2024年3月31日周日 00:10写道：
> > >> >
> > >> > > Congratulations!
> > >> > >
> > >> > > Best,
> > >> > > Lincoln Lee
> > >> > >
> > >> > >
> > >> > > Jark Wu  于2024年3月30日周六 22:13写道：
> > >> > >
> > >> > > > Congratulations!
> > >> > > >
> > >> > > > Best,
> > >> > > > Jark
> > >> > > >
> > >> > > > On Fri, 29 Mar 2024 at 12:08, Yun Tang 
> wrote:
> > >> > > >
> > >> > > > > Congratulations to all Paimon guys!
> > >> > > > >
> > >> > > > > Glad to see a Flink sub-project has been graduated to an
> Apache
> > >> > > top-level
> > >> > > > > project.
> > >> > > > >
> > >> > > > > Best
> > >> > > > > Yun Tang
> > >> > > > >
> > >> > > > > 
> > >> > > > > From: Hangxiang Yu 
> > >> > > > > Sent: Friday, March 29, 2024 10:32
> > >> > > > > To: dev@flink.apache.org 
> > >> > > > > Subject: Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top
> > >> Level
> > >> > > > Project
> > >> > > > >
> > >> > > > > Congratulations!
> > >> > > > >
> > >> > > > > On Fri, Mar 29, 2024 at 10:27 AM Benchao Li <
> > libenc...@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > Congratulations!
> > >> > > > > >
> > >> > > > > > Zakelly Lan  于2024年3月29日周五 10:25写道：
> > >> > > > > > >
> > >> > > > > > > Congratulations!
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Zakelly
> > >> > > > > > >
> > >> > > > > > > On Thu, Mar 28, 2024 at 10:13 PM Jing Ge
> > >> > >  > >> > > > >
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Congrats!
> > >> > > > > > > >
> > >> > > > > > > > Best regards,
> > >> > > > > > > > Jing
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Mar 28, 2024 at 1:27 PM Feifan Wang <
> > >> > zoltar9...@163.com>
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Congratulations!——
> > >> > > > > > > > >
> > >> > > > > > > > > Best regards,
> > >> > > > > > > > >
> > >> > > > > > > > > Feifan Wang
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > At 2024-03-28 20:02:43, "Yanfei Lei" <
> > fredia...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > > > > > > >Congratulations!
> > >> > > > > > > > > >
> > >> > > > > > > > > >Best,
> > >> > > > > > > > > >Yanfei
> > >> > > > > > > > > >
> > >> > > > > > > > > >Zhanghao Chen 
> > 于2024年3月28日周四
> > >> > > > 19:59写道：
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Congratulations!
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Best,
> > >> > > > > > > > > >> Zhanghao Chen
> > >> > > > > > > > > >> 
> > >> > > > > > > > > >> From: Yu Li 
> > >> > > > > > > > > >> Sent: Thursday, March 28, 2024 15:55
> > >> > > > > > > > > >> To: d...@paimon.apache.org 
> > >> > > > > > > > > >> Cc: dev ; user <
> > >> > u...@flink.apache.org
> > >> > > >
> > >> > > > > > > > > >> Subject: Re: [ANNOUNCE] Apache Paimon is graduated
> to
> > >> Top
> > >> > > > Level
> > >> > > > > > > > Project
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> CC the Flink user and dev mailing list.
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Paimon originated within the Flink community,
> > initially
> > >> > > known
> > >> > > > as
> > >> > > > > > Flink
> > >> > > > > > > > > >> Table Store, and all our incubating mentors are
> > >> members of
> > >> > > the
> > >> > > > > > Flink
> > >> > > > > > > > > >> Project Management Committee. I am confident that
> the
> > >> > bonds
> > >> > > of
> > >> > > > > > > > > >> enduring friendship and close collaboration will
> > >> continue
> > >> > to
> > >> > > > > > unite the
> > >> > > > > > > > > >> two communities.
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> And congratulations all!
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Best Regards,
> > >> > > > > > > > > >> Yu
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> On Wed, 27 Mar 2024 at 20:35, Guojun Li <
> > >> > > > > gjli.schna...@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > Congratulations!
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > Best,
> > >> > > > > > > > > >> > Guojun
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > On Wed, Mar 27, 2024 at 5:24 PM wulin <
> > >> > > ouyangwu...@163.com>
> > >> > > > > > wrote:
> > >> > > > > > > > > >> >
> > >> > > > > > >

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-01 Thread Ron liu

Hi, Venkata krishnan

Thank you for your involvement and suggestions, and hope that the design
goals of this FLIP will be helpful to your business.

>>> 1. In the proposed FLIP, given the example for the dynamic table, do the
data sources always come from a single lake storage such as Paimon or does
the same proposal solve for 2 disparate storage systems like Kafka and
Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
Basically the lambda architecture that is mentioned in the FLIP as well.
I'm wondering if it is possible to switch b/w sources based on the
execution mode, for eg: if it is backfill operation, switch to a data lake
storage system like Iceberg, otherwise an event streaming system like Kafka.

Dynamic table is a design abstraction at the framework level and is not
tied to the physical implementation of the connector. If a connector
supports a combination of Kafka and lake storage, this works fine.

>>> 2. What happens in the context of a bootstrap (batch) + nearline update
(streaming) case that are stateful applications? What I mean by that is,
will the state from the batch application be transferred to the nearline
application after the bootstrap execution is complete?

I think this is another orthogonal thing, something that FLIP-327 tries to
address, not directly related to Dynamic Table.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data

Best,
Ron

Venkatakrishnan Sowrirajan  于2024年3月30日周六 07:06写道：

> Ron and Lincoln,
>
> Great proposal and interesting discussion for adding support for dynamic
> tables within Flink.
>
> At LinkedIn, we are also trying to solve compute/storage convergence for
> similar problems discussed as part of this FLIP, specifically periodic
> backfill, bootstrap + nearline update use cases using single implementation
> of business logic (single script).
>
> Few clarifying questions:
>
> 1. In the proposed FLIP, given the example for the dynamic table, do the
> data sources always come from a single lake storage such as Paimon or does
> the same proposal solve for 2 disparate storage systems like Kafka and
> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> Basically the lambda architecture that is mentioned in the FLIP as well.
> I'm wondering if it is possible to switch b/w sources based on the
> execution mode, for eg: if it is backfill operation, switch to a data lake
> storage system like Iceberg, otherwise an event streaming system like
> Kafka.
> 2. What happens in the context of a bootstrap (batch) + nearline update
> (streaming) case that are stateful applications? What I mean by that is,
> will the state from the batch application be transferred to the nearline
> application after the bootstrap execution is complete?
>
> Regards
> Venkata krishnan
>
>
> On Mon, Mar 25, 2024 at 8:03 PM Ron liu  wrote:
>
> > Hi, Timo
> >
> > Thanks for your quick response, and your suggestion.
> >
> > Yes, this discussion has turned into confirming whether it's a special
> > table or a special MV.
> >
> > 1. The key problem with MVs is that they don't support modification, so I
> > prefer it to be a special table. Although the periodic refresh behavior
> is
> > more characteristic of an MV, since we are already a special table,
> > supporting periodic refresh behavior is quite natural, similar to
> Snowflake
> > dynamic tables.
> >
> > 2. Regarding the keyword UPDATING, since the current Regular Table is a
> > Dynamic Table, which implies support for updating through Continuous
> Query,
> > I think it is redundant to add the keyword UPDATING. In addition,
> UPDATING
> > can not reflect the Continuous Query part, can not express the purpose we
> > want to simplify the data pipeline through Dynamic Table + Continuous
> > Query.
> >
> > 3. From the perspective of the SQL standard definition, I can understand
> > your concerns about Derived Table, but is it possible to make a slight
> > adjustment to meet our needs? Additionally, as Lincoln mentioned, the
> > Google Looker platform has introduced Persistent Derived Table, and there
> > are precedents in the industry; could Derived Table be a candidate?
> >
> > Of course, look forward to your better suggestions.
> >
> > Best,
> > Ron
> >
> >
> >
> > Timo Walther  于2024年3月25日周一 18:49写道：
> >
> > > After thinking about this more, this discussion boils down to whether
> > > this is a special table or a special materialized view. In both cases,
> > > we would need to add a special keyword:
> > >
> > &

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-02 Thread Ron liu

Hi, dev

After offline discussion with Becket Qin, Lincoln Lee and Jark Wu,  we have
improved some parts of the FLIP.

1. Add Full Refresh Mode section to clarify the semantics of full refresh
mode.
2. Add Future Improvement section explaining why query statement does not
support references to temporary view and possible solutions.
3. The Future Improvement section explains a possible future solution for
dynamic table to support the modification of query statements to meet the
common field-level schema evolution requirements of the lakehouse.
4. The Refresh section emphasizes that the Refresh command and the
background refresh job can be executed in parallel, with no restrictions at
the framework level.
5. Convert RefreshHandler into a plug-in interface to support various
workflow schedulers.

Best,
Ron

Ron liu  于2024年4月2日周二 10:28写道：

> Hi, Venkata krishnan
>
> Thank you for your involvement and suggestions, and hope that the design
> goals of this FLIP will be helpful to your business.
>
> >>> 1. In the proposed FLIP, given the example for the dynamic table, do
> the
> data sources always come from a single lake storage such as Paimon or does
> the same proposal solve for 2 disparate storage systems like Kafka and
> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> Basically the lambda architecture that is mentioned in the FLIP as well.
> I'm wondering if it is possible to switch b/w sources based on the
> execution mode, for eg: if it is backfill operation, switch to a data lake
> storage system like Iceberg, otherwise an event streaming system like
> Kafka.
>
> Dynamic table is a design abstraction at the framework level and is not
> tied to the physical implementation of the connector. If a connector
> supports a combination of Kafka and lake storage, this works fine.
>
> >>> 2. What happens in the context of a bootstrap (batch) + nearline update
> (streaming) case that are stateful applications? What I mean by that is,
> will the state from the batch application be transferred to the nearline
> application after the bootstrap execution is complete?
>
> I think this is another orthogonal thing, something that FLIP-327 tries to
> address, not directly related to Dynamic Table.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
>
> Best,
> Ron
>
> Venkatakrishnan Sowrirajan  于2024年3月30日周六 07:06写道：
>
>> Ron and Lincoln,
>>
>> Great proposal and interesting discussion for adding support for dynamic
>> tables within Flink.
>>
>> At LinkedIn, we are also trying to solve compute/storage convergence for
>> similar problems discussed as part of this FLIP, specifically periodic
>> backfill, bootstrap + nearline update use cases using single
>> implementation
>> of business logic (single script).
>>
>> Few clarifying questions:
>>
>> 1. In the proposed FLIP, given the example for the dynamic table, do the
>> data sources always come from a single lake storage such as Paimon or does
>> the same proposal solve for 2 disparate storage systems like Kafka and
>> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
>> Basically the lambda architecture that is mentioned in the FLIP as well.
>> I'm wondering if it is possible to switch b/w sources based on the
>> execution mode, for eg: if it is backfill operation, switch to a data lake
>> storage system like Iceberg, otherwise an event streaming system like
>> Kafka.
>> 2. What happens in the context of a bootstrap (batch) + nearline update
>> (streaming) case that are stateful applications? What I mean by that is,
>> will the state from the batch application be transferred to the nearline
>> application after the bootstrap execution is complete?
>>
>> Regards
>> Venkata krishnan
>>
>>
>> On Mon, Mar 25, 2024 at 8:03 PM Ron liu  wrote:
>>
>> > Hi, Timo
>> >
>> > Thanks for your quick response, and your suggestion.
>> >
>> > Yes, this discussion has turned into confirming whether it's a special
>> > table or a special MV.
>> >
>> > 1. The key problem with MVs is that they don't support modification, so
>> I
>> > prefer it to be a special table. Although the periodic refresh behavior
>> is
>> > more characteristic of an MV, since we are already a special table,
>> > supporting periodic refresh behavior is quite natural, similar to
>> Snowflake
>> > dynamic tables.
>> >
>> > 2. Regarding the keyword UPDATING, since the current Regular Table is a

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-07 Thread Ron liu

bricks has CREATE STREAMING TABLE [1] which relates with this
> > proposal.
> >
> > I do have concerns about using CREATE DYNAMIC TABLE, specifically about
> > confusing the users who are familiar with Snowflake's approach where you
> > can't change the content via DML statements, while that is something that
> > would work in this proposal. Naming is hard of course, but I would
> probably
> > prefer something like CREATE CONTINUOUS TABLE, CREATE REFRESH TABLE or
> > CREATE LIVE TABLE.
> >
> > Best regards,
> >
> > Martijn
> >
> > [1]
> >
> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table.html
> >
> > On Wed, Apr 3, 2024 at 5:19 AM Ron liu  wrote:
> >
> > > Hi, dev
> > >
> > > After offline discussion with Becket Qin, Lincoln Lee and Jark Wu, we
> have
> > > improved some parts of the FLIP.
> > >
> > > 1. Add Full Refresh Mode section to clarify the semantics of full
> refresh
> > > mode.
> > > 2. Add Future Improvement section explaining why query statement does
> not
> > > support references to temporary view and possible solutions.
> > > 3. The Future Improvement section explains a possible future solution
> for
> > > dynamic table to support the modification of query statements to meet
> the
> > > common field-level schema evolution requirements of the lakehouse.
> > > 4. The Refresh section emphasizes that the Refresh command and the
> > > background refresh job can be executed in parallel, with no
> restrictions at
> > > the framework level.
> > > 5. Convert RefreshHandler into a plug-in interface to support various
> > > workflow schedulers.
> > >
> > > Best,
> > > Ron
> > >
> > > Ron liu  于2024年4月2日周二 10:28写道：
> > >
> > > > > Hi, Venkata krishnan
> > > > >
> > > > > Thank you for your involvement and suggestions, and hope that the
> design
> > > > > goals of this FLIP will be helpful to your business.
> > > > >
> > > > > > > > >>> 1. In the proposed FLIP, given the example for the
> dynamic table, do
> > > > > the
> > > > > data sources always come from a single lake storage such as Paimon
> or
> > > does
> > > > > the same proposal solve for 2 disparate storage systems like Kafka
> and
> > > > > Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> > > > > Basically the lambda architecture that is mentioned in the FLIP as
> well.
> > > > > I'm wondering if it is possible to switch b/w sources based on the
> > > > > execution mode, for eg: if it is backfill operation, switch to a
> data
> > > lake
> > > > > storage system like Iceberg, otherwise an event streaming system
> like
> > > > > Kafka.
> > > > >
> > > > > Dynamic table is a design abstraction at the framework level and
> is not
> > > > > tied to the physical implementation of the connector. If a
> connector
> > > > > supports a combination of Kafka and lake storage, this works fine.
> > > > >
> > > > > > > > >>> 2. What happens in the context of a bootstrap (batch) +
> nearline
> > > update
> > > > > (streaming) case that are stateful applications? What I mean by
> that is,
> > > > > will the state from the batch application be transferred to the
> nearline
> > > > > application after the bootstrap execution is complete?
> > > > >
> > > > > I think this is another orthogonal thing, something that FLIP-327
> tries
> > > to
> > > > > address, not directly related to Dynamic Table.
> > > > >
> > > > > [1]
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
> > > > >
> > > > > Best,
> > > > > Ron
> > > > >
> > > > > Venkatakrishnan Sowrirajan  于2024年3月30日周六
> 07:06写道：
> > > > >
> > > > > >> Ron and Lincoln,
> > > > > >>
> > > > > >> Great proposal and interesting discussion for adding support
> for dynamic
> > > > > >> tables within Flink.
> > > > > >>
> &

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-07 Thread Ron liu

Hi, Dev

This is a summary letter. After several rounds of discussion, there is a
strong consensus about the FLIP proposal and the issues it aims to address.
The current point of disagreement is the naming of the new concept. I have
summarized the candidates as follows:

1. Derived Table (Inspired by Google Lookers)
- Pros: Google Lookers has introduced this concept, which is designed
for building Looker's automated modeling, aligning with our purpose for the
stream-batch automatic pipeline.

- Cons: The SQL standard uses derived table term extensively, vendors
adopt this for simply referring to a table within a subclause.

2. Materialized Table: It means materialize the query result to table,
similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
Dynamic Table's predecessor is also called Materialized Table.

3. Updating Table (From Timo)

4. Updating Materialized View (From Timo)

5. Refresh/Live Table (From Martijn)

As Martijn said, naming is a headache, looking forward to more valuable
input from everyone.

[1]
https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
[2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
[3]
https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables

Best,
Ron

Ron liu  于2024年4月7日周日 15:55写道：

> Hi, Lorenzo
>
> Thank you for your insightful input.
>
> >>> I think the 2 above twisted the materialized view concept to more than
> just an optimization for accessing pre-computed aggregates/filters.
> I think that concept (at least in my mind) is now adherent to the
> semantics of the words themselves ("materialized" and "view") than on its
> implementations in DBMs, as just a view on raw data that, hopefully, is
> constantly updated with fresh results.
> That's why I understand Timo's et al. objections.
>
> Your understanding of Materialized Views is correct. However, in our
> scenario, an important feature is the support for Update & Delete
> operations, which the current Materialized Views cannot fulfill. As we
> discussed with Timo before, if Materialized Views needs to support data
> modifications, it would require an extension of new keywords, such as
> CREATING xxx (UPDATING) MATERIALIZED VIEW.
>
> >>> Still, I don't understand why we need another type of special table.
> Could you dive deep into the reasons why not simply adding the FRESHNESS
> parameter to standard tables?
>
> Firstly, I need to emphasize that we cannot achieve the design goal of
> FLIP through the CREATE TABLE syntax combined with a FRESHNESS parameter.
> The proposal of this FLIP is to use Dynamic Table + Continuous Query, and
> combine it with FRESHNESS to realize a streaming-batch unification.
> However, CREATE TABLE is merely a metadata operation and cannot
> automatically start a background refresh job. To achieve the design goal of
> FLIP with standard tables, it would require extending the CTAS[1] syntax to
> introduce the FRESHNESS keyword. We considered this design initially, but
> it has following problems:
>
> 1. Distinguishing a table created through CTAS as a standard table or as a
> "special" standard table with an ongoing background refresh job using the
> FRESHNESS keyword is very obscure for users.
> 2. It intrudes on the semantics of the CTAS syntax. Currently, tables
> created using CTAS only add table metadata to the Catalog and do not record
> attributes such as query. There are also no ongoing background refresh
> jobs, and the data writing operation happens only once at table creation.
> 3. For the framework, when we perform a certain kind of Alter Table
> behavior for a table, for the table created by specifying FRESHNESS and did
> not specify the FRESHNESS created table behavior how to distinguish , which
> will also cause confusion.
>
> In terms of the design goal of combining Dynamic Table + Continuous Query,
> the FLIP proposal cannot be realized by only extending the current stardand
> tables, so a new kind of dynamic table needs to be introduced at the
> first-level concept.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement
>
> Best,
> Ron
>
>  于2024年4月3日周三 22:25写道：
>
>> Hello everybody!
>> Thanks for the FLIP as it looks amazing (and I think the prove is this
>> deep discussion it is provoking :))
>>
>> I have a couple of comments to add to this:
>>
>> Even though I get the reason why you rejected MATERIALIZED VIEW, I still
>> like it a lot, and I would like to provide pointers on how the materialized
>> view concept twisted in last years:
>>
>> • Material

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu

Hi, Timo

Thanks for your reply.

I agree with you that sometimes naming is more difficult. When no one has a
clear preference, voting on the name is a good solution, so I'll send a
separate email for the vote, clarify the rules for the vote, then let
everyone vote.

One other point to confirm, in your ranking there is an option for
Materialized View, does it stand for the UPDATING Materialized View that
you mentioned earlier in the discussion? If using Materialized View I think
it is needed to extend it.

Best,
Ron

Timo Walther  于2024年4月9日周二 17:20写道：

> Hi Ron,
>
> yes naming is hard. But it will have large impact on trainings,
> presentations, and the mental model of users. Maybe the easiest is to
> collect ranking by everyone with some short justification:
>
>
> My ranking (from good to not so good):
>
> 1. Refresh Table -> states what it does
> 2. Materialized Table -> similar to SQL materialized view but a table
> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
> tables?
> 4. Materialized View -> a bit broader than standard but still very similar
> 5. Derived table -> taken by standard
>
> Regards,
> Timo
>
>
>
> On 07.04.24 11:34, Ron liu wrote:
> > Hi, Dev
> >
> > This is a summary letter. After several rounds of discussion, there is a
> > strong consensus about the FLIP proposal and the issues it aims to
> address.
> > The current point of disagreement is the naming of the new concept. I
> have
> > summarized the candidates as follows:
> >
> > 1. Derived Table (Inspired by Google Lookers)
> >  - Pros: Google Lookers has introduced this concept, which is
> designed
> > for building Looker's automated modeling, aligning with our purpose for
> the
> > stream-batch automatic pipeline.
> >
> >  - Cons: The SQL standard uses derived table term extensively,
> vendors
> > adopt this for simply referring to a table within a subclause.
> >
> > 2. Materialized Table: It means materialize the query result to table,
> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
> > Dynamic Table's predecessor is also called Materialized Table.
> >
> > 3. Updating Table (From Timo)
> >
> > 4. Updating Materialized View (From Timo)
> >
> > 5. Refresh/Live Table (From Martijn)
> >
> > As Martijn said, naming is a headache, looking forward to more valuable
> > input from everyone.
> >
> > [1]
> >
> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
> > [2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
> > [3]
> >
> https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables
> >
> > Best,
> > Ron
> >
> > Ron liu  于2024年4月7日周日 15:55写道：
> >
> >> Hi, Lorenzo
> >>
> >> Thank you for your insightful input.
> >>
> >>>>> I think the 2 above twisted the materialized view concept to more
> than
> >> just an optimization for accessing pre-computed aggregates/filters.
> >> I think that concept (at least in my mind) is now adherent to the
> >> semantics of the words themselves ("materialized" and "view") than on
> its
> >> implementations in DBMs, as just a view on raw data that, hopefully, is
> >> constantly updated with fresh results.
> >> That's why I understand Timo's et al. objections.
> >>
> >> Your understanding of Materialized Views is correct. However, in our
> >> scenario, an important feature is the support for Update & Delete
> >> operations, which the current Materialized Views cannot fulfill. As we
> >> discussed with Timo before, if Materialized Views needs to support data
> >> modifications, it would require an extension of new keywords, such as
> >> CREATING xxx (UPDATING) MATERIALIZED VIEW.
> >>
> >>>>> Still, I don't understand why we need another type of special table.
> >> Could you dive deep into the reasons why not simply adding the FRESHNESS
> >> parameter to standard tables?
> >>
> >> Firstly, I need to emphasize that we cannot achieve the design goal of
> >> FLIP through the CREATE TABLE syntax combined with a FRESHNESS
> parameter.
> >> The proposal of this FLIP is to use Dynamic Table + Continuous Query,
> and
> >> combine it with FRESHNESS to realize a streaming-batch unification.
> >> However, CREATE TABLE is merely a metadata operation and cannot
> >> automatic

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu

Hi, Dev

Sorry for my previous statement was not quite accurate. We will hold a vote
for the name within this thread.

Best,
Ron


Ron liu  于2024年4月9日周二 19:29写道：

> Hi, Timo
>
> Thanks for your reply.
>
> I agree with you that sometimes naming is more difficult. When no one has
> a clear preference, voting on the name is a good solution, so I'll send a
> separate email for the vote, clarify the rules for the vote, then let
> everyone vote.
>
> One other point to confirm, in your ranking there is an option for
> Materialized View, does it stand for the UPDATING Materialized View that
> you mentioned earlier in the discussion? If using Materialized View I think
> it is needed to extend it.
>
> Best,
> Ron
>
> Timo Walther  于2024年4月9日周二 17:20写道：
>
>> Hi Ron,
>>
>> yes naming is hard. But it will have large impact on trainings,
>> presentations, and the mental model of users. Maybe the easiest is to
>> collect ranking by everyone with some short justification:
>>
>>
>> My ranking (from good to not so good):
>>
>> 1. Refresh Table -> states what it does
>> 2. Materialized Table -> similar to SQL materialized view but a table
>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>> tables?
>> 4. Materialized View -> a bit broader than standard but still very similar
>> 5. Derived table -> taken by standard
>>
>> Regards,
>> Timo
>>
>>
>>
>> On 07.04.24 11:34, Ron liu wrote:
>> > Hi, Dev
>> >
>> > This is a summary letter. After several rounds of discussion, there is a
>> > strong consensus about the FLIP proposal and the issues it aims to
>> address.
>> > The current point of disagreement is the naming of the new concept. I
>> have
>> > summarized the candidates as follows:
>> >
>> > 1. Derived Table (Inspired by Google Lookers)
>> >  - Pros: Google Lookers has introduced this concept, which is
>> designed
>> > for building Looker's automated modeling, aligning with our purpose for
>> the
>> > stream-batch automatic pipeline.
>> >
>> >  - Cons: The SQL standard uses derived table term extensively,
>> vendors
>> > adopt this for simply referring to a table within a subclause.
>> >
>> > 2. Materialized Table: It means materialize the query result to table,
>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
>> > Dynamic Table's predecessor is also called Materialized Table.
>> >
>> > 3. Updating Table (From Timo)
>> >
>> > 4. Updating Materialized View (From Timo)
>> >
>> > 5. Refresh/Live Table (From Martijn)
>> >
>> > As Martijn said, naming is a headache, looking forward to more valuable
>> > input from everyone.
>> >
>> > [1]
>> >
>> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
>> > [2]
>> https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
>> > [3]
>> >
>> https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables
>> >
>> > Best,
>> > Ron
>> >
>> > Ron liu  于2024年4月7日周日 15:55写道：
>> >
>> >> Hi, Lorenzo
>> >>
>> >> Thank you for your insightful input.
>> >>
>> >>>>> I think the 2 above twisted the materialized view concept to more
>> than
>> >> just an optimization for accessing pre-computed aggregates/filters.
>> >> I think that concept (at least in my mind) is now adherent to the
>> >> semantics of the words themselves ("materialized" and "view") than on
>> its
>> >> implementations in DBMs, as just a view on raw data that, hopefully, is
>> >> constantly updated with fresh results.
>> >> That's why I understand Timo's et al. objections.
>> >>
>> >> Your understanding of Materialized Views is correct. However, in our
>> >> scenario, an important feature is the support for Update & Delete
>> >> operations, which the current Materialized Views cannot fulfill. As we
>> >> discussed with Timo before, if Materialized Views needs to support data
>> >> modifications, it would require an extension of new keywords, such as
>> >> CREATING xxx (UPDATING) MATERIALIZED VIEW.
>> >>
>> >>>>> Still, I don't understand why we need anoth

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu

Hi, Dev

After several rounds of discussion, there is currently no consensus on the
name of the new concept. Timo has proposed that we decide the name through
a vote. This is a good solution when there is no clear preference, so we
will adopt this approach.

Regarding the name of the new concept, there are currently five candidates:
1. Derived Table -> taken by SQL standard
2. Materialized Table -> similar to SQL materialized view but a table
3. Live Table -> similar to dynamic tables
4. Refresh Table -> states what it does
5. Materialized View -> needs to extend the standard to support modifying
data

For the above five candidates, everyone can give your rankings based on
your preferences. You can choose up to five options or only choose some of
them.
We will use a scoring rule, where the* first rank gets 5 points, second
rank gets 4 points, third rank gets 3 points, fourth rank gets 2 points,
and fifth rank gets 1 point*.
After the voting closes, I will score all the candidates based on
everyone's votes, and the candidate with the highest score will be chosen
as the name for the new concept.

The voting will last up to 72 hours and is expected to close this Friday. I
look forward to everyone voting on the name in this thread. Of course, we
also welcome new input regarding the name.

Best,
Ron

Ron liu  于2024年4月9日周二 19:49写道：

> Hi, Dev
>
> Sorry for my previous statement was not quite accurate. We will hold a
> vote for the name within this thread.
>
> Best,
> Ron
>
>
> Ron liu  于2024年4月9日周二 19:29写道：
>
>> Hi, Timo
>>
>> Thanks for your reply.
>>
>> I agree with you that sometimes naming is more difficult. When no one has
>> a clear preference, voting on the name is a good solution, so I'll send a
>> separate email for the vote, clarify the rules for the vote, then let
>> everyone vote.
>>
>> One other point to confirm, in your ranking there is an option for
>> Materialized View, does it stand for the UPDATING Materialized View that
>> you mentioned earlier in the discussion? If using Materialized View I think
>> it is needed to extend it.
>>
>> Best,
>> Ron
>>
>> Timo Walther  于2024年4月9日周二 17:20写道：
>>
>>> Hi Ron,
>>>
>>> yes naming is hard. But it will have large impact on trainings,
>>> presentations, and the mental model of users. Maybe the easiest is to
>>> collect ranking by everyone with some short justification:
>>>
>>>
>>> My ranking (from good to not so good):
>>>
>>> 1. Refresh Table -> states what it does
>>> 2. Materialized Table -> similar to SQL materialized view but a table
>>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>>> tables?
>>> 4. Materialized View -> a bit broader than standard but still very
>>> similar
>>> 5. Derived table -> taken by standard
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>>
>>> On 07.04.24 11:34, Ron liu wrote:
>>> > Hi, Dev
>>> >
>>> > This is a summary letter. After several rounds of discussion, there is
>>> a
>>> > strong consensus about the FLIP proposal and the issues it aims to
>>> address.
>>> > The current point of disagreement is the naming of the new concept. I
>>> have
>>> > summarized the candidates as follows:
>>> >
>>> > 1. Derived Table (Inspired by Google Lookers)
>>> >  - Pros: Google Lookers has introduced this concept, which is
>>> designed
>>> > for building Looker's automated modeling, aligning with our purpose
>>> for the
>>> > stream-batch automatic pipeline.
>>> >
>>> >  - Cons: The SQL standard uses derived table term extensively,
>>> vendors
>>> > adopt this for simply referring to a table within a subclause.
>>> >
>>> > 2. Materialized Table: It means materialize the query result to table,
>>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
>>> > Dynamic Table's predecessor is also called Materialized Table.
>>> >
>>> > 3. Updating Table (From Timo)
>>> >
>>> > 4. Updating Materialized View (From Timo)
>>> >
>>> > 5. Refresh/Live Table (From Martijn)
>>> >
>>> > As Martijn said, naming is a headache, looking forward to more valuable
>>> > input from everyone.
>>> >
>>> > [1]
>>> >
>>> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
>>>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu

Hi, Dev

My rankings are:

1. Derived Table
2. Materialized Table
3. Live Table
4. Materialized View

Best,
Ron



Ron liu  于2024年4月9日周二 20:07写道：

> Hi, Dev
>
> After several rounds of discussion, there is currently no consensus on the
> name of the new concept. Timo has proposed that we decide the name through
> a vote. This is a good solution when there is no clear preference, so we
> will adopt this approach.
>
> Regarding the name of the new concept, there are currently five candidates:
> 1. Derived Table -> taken by SQL standard
> 2. Materialized Table -> similar to SQL materialized view but a table
> 3. Live Table -> similar to dynamic tables
> 4. Refresh Table -> states what it does
> 5. Materialized View -> needs to extend the standard to support modifying
> data
>
> For the above five candidates, everyone can give your rankings based on
> your preferences. You can choose up to five options or only choose some of
> them.
> We will use a scoring rule, where the* first rank gets 5 points, second
> rank gets 4 points, third rank gets 3 points, fourth rank gets 2 points,
> and fifth rank gets 1 point*.
> After the voting closes, I will score all the candidates based on
> everyone's votes, and the candidate with the highest score will be chosen
> as the name for the new concept.
>
> The voting will last up to 72 hours and is expected to close this Friday.
> I look forward to everyone voting on the name in this thread. Of course, we
> also welcome new input regarding the name.
>
> Best,
> Ron
>
> Ron liu  于2024年4月9日周二 19:49写道：
>
>> Hi, Dev
>>
>> Sorry for my previous statement was not quite accurate. We will hold a
>> vote for the name within this thread.
>>
>> Best,
>> Ron
>>
>>
>> Ron liu  于2024年4月9日周二 19:29写道：
>>
>>> Hi, Timo
>>>
>>> Thanks for your reply.
>>>
>>> I agree with you that sometimes naming is more difficult. When no one
>>> has a clear preference, voting on the name is a good solution, so I'll send
>>> a separate email for the vote, clarify the rules for the vote, then let
>>> everyone vote.
>>>
>>> One other point to confirm, in your ranking there is an option for
>>> Materialized View, does it stand for the UPDATING Materialized View that
>>> you mentioned earlier in the discussion? If using Materialized View I think
>>> it is needed to extend it.
>>>
>>> Best,
>>> Ron
>>>
>>> Timo Walther  于2024年4月9日周二 17:20写道：
>>>
>>>> Hi Ron,
>>>>
>>>> yes naming is hard. But it will have large impact on trainings,
>>>> presentations, and the mental model of users. Maybe the easiest is to
>>>> collect ranking by everyone with some short justification:
>>>>
>>>>
>>>> My ranking (from good to not so good):
>>>>
>>>> 1. Refresh Table -> states what it does
>>>> 2. Materialized Table -> similar to SQL materialized view but a table
>>>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>>>> tables?
>>>> 4. Materialized View -> a bit broader than standard but still very
>>>> similar
>>>> 5. Derived table -> taken by standard
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>>
>>>> On 07.04.24 11:34, Ron liu wrote:
>>>> > Hi, Dev
>>>> >
>>>> > This is a summary letter. After several rounds of discussion, there
>>>> is a
>>>> > strong consensus about the FLIP proposal and the issues it aims to
>>>> address.
>>>> > The current point of disagreement is the naming of the new concept. I
>>>> have
>>>> > summarized the candidates as follows:
>>>> >
>>>> > 1. Derived Table (Inspired by Google Lookers)
>>>> >  - Pros: Google Lookers has introduced this concept, which is
>>>> designed
>>>> > for building Looker's automated modeling, aligning with our purpose
>>>> for the
>>>> > stream-batch automatic pipeline.
>>>> >
>>>> >  - Cons: The SQL standard uses derived table term extensively,
>>>> vendors
>>>> > adopt this for simply referring to a table within a subclause.
>>>> >
>>>> > 2. Materialized Table: It means materialize the query result to table,
>>>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflak

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-12 Thread Ron liu

tly is a little weird.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Tue, Apr 9, 2024 at 5:46 AM Lincoln Lee 
> > > wrote:
> > > >
> > > > > Thanks Ron and Timo for your proposal!
> > > > >
> > > > > Here is my ranking:
> > > > >
> > > > > 1. Derived table -> extend the persistent semantics of derived
> table
> > in
> > > > SQL
> > > > >standard, with a strong association with query, and has industry
> > > > > precedents
> > > > >such as Google Looker.
> > > > >
> > > > > 2. Live Table ->  an alternative for 'dynamic table'
> > > > >
> > > > > 3. Materialized Table -> combination of the Materialized View and
> > > Table,
> > > > > but
> > > > > still a table which accept data changes
> > > > >
> > > > > 4. Materialized View -> need to extend understanding of the view to
> > > > accept
> > > > > data changes
> > > > >
> > > > > The reason for not adding 'Refresh Table' is I don't want to tell
> the
> > > > user
> > > > > to 'refresh a refresh table'.
> > > > >
> > > > >
> > > > > Best,
> > > > > Lincoln Lee
> > > > >
> > > > >
> > > > > Ron liu  于2024年4月9日周二 20:11写道：
> > > > >
> > > > > > Hi, Dev
> > > > > >
> > > > > > My rankings are:
> > > > > >
> > > > > > 1. Derived Table
> > > > > > 2. Materialized Table
> > > > > > 3. Live Table
> > > > > > 4. Materialized View
> > > > > >
> > > > > > Best,
> > > > > > Ron
> > > > > >
> > > > > >
> > > > > >
> > > > > > Ron liu  于2024年4月9日周二 20:07写道：
> > > > > >
> > > > > > > Hi, Dev
> > > > > > >
> > > > > > > After several rounds of discussion, there is currently no
> > consensus
> > > > on
> > > > > > the
> > > > > > > name of the new concept. Timo has proposed that we decide the
> > name
> > > > > > through
> > > > > > > a vote. This is a good solution when there is no clear
> > preference,
> > > so
> > > > > we
> > > > > > > will adopt this approach.
> > > > > > >
> > > > > > > Regarding the name of the new concept, there are currently five
> > > > > > candidates:
> > > > > > > 1. Derived Table -> taken by SQL standard
> > > > > > > 2. Materialized Table -> similar to SQL materialized view but a
> > > table
> > > > > > > 3. Live Table -> similar to dynamic tables
> > > > > > > 4. Refresh Table -> states what it does
> > > > > > > 5. Materialized View -> needs to extend the standard to support
> > > > > modifying
> > > > > > > data
> > > > > > >
> > > > > > > For the above five candidates, everyone can give your rankings
> > > based
> > > > on
> > > > > > > your preferences. You can choose up to five options or only
> > choose
> > > > some
> > > > > > of
> > > > > > > them.
> > > > > > > We will use a scoring rule, where the* first rank gets 5
> points,
> > > > second
> > > > > > > rank gets 4 points, third rank gets 3 points, fourth rank gets
> 2
> > > > > points,
> > > > > > > and fifth rank gets 1 point*.
> > > > > > > After the voting closes, I will score all the candidates based
> on
> > > > > > > everyone's votes, and the candidate with the highest score will
> > be
> > > > > chosen
> > > > > > > as the name for the new concept.
> > > > > > >
> > > > > > > The voting will last up to 72 hours and is expected to close
> this
> > > > > Friday.
> > > > > &g

Re: [ANNOUNCE] New Apache Flink PMC Member - Jing Ge

2024-04-12 Thread Ron liu

Congratulations, Jing!

Best,
Ron

Junrui Lee  于2024年4月12日周五 18:54写道：

> Congratulations, Jing!
>
> Best,
> Junrui
>
> Aleksandr Pilipenko  于2024年4月12日周五 18:28写道：
>
> > Congratulations, Jing!
> >
> > Best Regards,
> > Aleksandr
> >
>

Re: [ANNOUNCE] New Apache Flink PMC Member - Lincoln Lee

2024-04-12 Thread Ron liu

Congratulations, Lincoln!

Best,
Ron

Junrui Lee  于2024年4月12日周五 18:54写道：

> Congratulations, Lincoln!
>
> Best,
> Junrui
>
> Aleksandr Pilipenko  于2024年4月12日周五 18:29写道：
>
> > Congratulations, Lincoln!
> >
> > Best Regards
> > Aleksandr
> >
>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-12 Thread Ron liu

FLIP-282 [1] has also introduced Update & Delete API for modifying table.

1.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235838061

Best,
Ron

Ron liu  于2024年4月12日周五 19:49写道：

> Hi, jgrier
>
> Thanks for your insightful input.
>
> First of all, very much agree with you that it is a right direction that
> we should strive towards making Flink SQL more user-friendly, including
> simplifying the job execution parameters, execution modes, data processing
> pipeline definitions and maintenance, and so on.
> The goal of this proposal is also to simplify the data processing pipeline
> by proposing a new Dynamic Table, by combining Dynamic Table + Continuous,
> so that users can focus more on the business itself. Our goal is also not
> to create new business scenarios, it's just that the current Table can't
> support this goal, so we need to propose a new type of Dynamic Table.
>
> In the traditional Hive warehouse and Lakhouse scenario, the common
> requirement from users begins with ingesting DB data such as MySQL and logs
> in real-time into the ODS layer of the data warehouse. Then, defining a
> series of ETL jobs to process and layer the raw data, with the general data
> flow being ODS -> DWD -> DWS -> ADS, ultimately serving different users.
>
> During the business process, the following scenarios may need to modify
> the data:
> 1. Creating partitioned tables and manually backfilling certain historical
> partitions to correct data, meaning overwriting partitions is necessary.
> 2. Deleting a set of rows for regulatory compliance, updating a set of
> rows for data correction, such as deleting sensitive user information in a
> GDPR scenario.
> 3. With changes in business requirements, adding some columns is necessary
> but without wanting to refreshing historical partition data, so the new
> columns would only apply to the latest partitions.
>
> In the SQL standard regarding the definition of View, there are the
> following restrictions:
> 1. Partitioned view is not supported.
> 2. Modification of the data generated by views is not supported.
> 3. Alteration of a View's schema, such as adding columns, is not
> supported.
>
> Please correct me if my understanding is wrong.
>
> Materialized view, representing the result of a select query and serving
> as an index optimization technique mainly for query rewriting and
> computation acceleration, so share the same the same limitation as View. If
> we use materialized view, it can't meet our needs directly, we have to
> extend its semantics, which is in conflict with the standard. If we use a
> table, we don't have these concerns. Also assuming we extend the
> materialized view semantics to allow for modification, this would result in
> its inability to support query rewriting.
>
> Our proposal is indeed similar to the ability of materialized view, but
> considering the following two factors: firstly, we should try to follow the
> standard as much as possible without conflicting with it, and secondly,
> materialized view does not directly satisfy the scenario of modifying data,
> so using Table would be more appropriate.
>
> Although materialized view is also one of the candidates, it is not a more
> suitable option.
>
>
> > I'm actually against all of the other proposed names so I rank them
> equally
> last.  I don't think we need yet another new concept for this.  I think
> that will just add to users' confusion and learning curve which is already
> substantial with Flink.  We need to make things easier rather than harder.
>
> Also, just to clarify, and sorry if my previous statement may not be that
> accurate, this is not a new concept, it is just a new type of table,
> similar to the capabilities of materialized view, but simplifies the data
> processing pipeline, which is also aligned with the long term vision of
> Flink SQL.
>
>
> Best,
> Ron
>
>
> Jamie Grier  于2024年4月11日周四 05:59写道：
>
>> Sorry for coming very late to this thread.  I have not contributed much to
>> Flink publicly for quite some time but I have been involved with Flink,
>> daily, for years now and I'm keenly interested in where we take Flink SQL
>> going forward.
>>
>> Thanks for the proposal!!  I think it's definitely a step in the right
>> direction and I'm thrilled this is happening.  The end state I have in
>> mind
>> is that we get rid of execution modes as something users have to think
>> about and instead make sure the SQL a user writes completely describes
>> their intent.  In the case of this proposal the intent a user has is that
>> the system continually maintains an object (whatever we dec

Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan

2024-04-14 Thread Ron liu

Congratulations!

Best,
Ron

Yuan Mei  于2024年4月15日周一 10:51写道：

> Hi everyone,
>
> On behalf of the PMC, I'm happy to let you know that Zakelly Lan has become
> a new Flink Committer!
>
> Zakelly has been continuously contributing to the Flink project since 2020,
> with a focus area on Checkpointing, State as well as frocksdb (the default
> on-disk state db).
>
> He leads several FLIPs to improve checkpoints and state APIs, including
> File Merging for Checkpoints and configuration/API reorganizations. He is
> also one of the main contributors to the recent efforts of "disaggregated
> state management for Flink 2.0" and drives the entire discussion in the
> mailing thread, demonstrating outstanding technical depth and breadth of
> knowledge.
>
> Beyond his technical contributions, Zakelly is passionate about helping the
> community in numerous ways. He spent quite some time setting up the Flink
> Speed Center and rebuilding the benchmark pipeline after the original one
> was out of lease. He helps build frocksdb and tests for the upcoming
> frocksdb release (bump rocksdb from 6.20.3->8.10).
>
> Please join me in congratulating Zakelly for becoming an Apache Flink
> committer!
>
> Best,
> Yuan (on behalf of the Flink PMC)
>

[VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-16 Thread Ron liu

Hi Dev,

Thank you to everyone for the feedback on FLIP-435: Introduce a New
Materialized Table for Simplifying Data Pipelines[1][2].

I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
[2] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs

Best,
Ron

Re: [VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-16 Thread Ron liu

+1(binding)

Best,
Ron

Ron liu  于2024年4月17日周三 14:27写道：

> Hi Dev,
>
> Thank you to everyone for the feedback on FLIP-435: Introduce a New
> Materialized Table for Simplifying Data Pipelines[1][2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or not enough votes.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
> [2] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
>
> Best,
> Ron
>

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-17 Thread Ron liu

Hi, Xia

Thanks for driving this FLIP.

This proposal looks good to me overall. However, I have the following minor
questions:

1. FLIP introduced `table.exec.hive.infer-source-parallelism.mode` as a new
parameter, and the value is the enum class `InferMode`, I think the
InferMode class should also be introduced in the Public Interfaces section!
2. You mentioned in FLIP that the default value of
`table.exec.hive.infer-source-parallelism.max` is 1024, I checked through
the code that the default value is 1000?
3. I also agree with Muhammet's idea that there is no need to introduce the
option `table.exec.hive.infer-source-parallelism.enabled`, and that
expanding the InferMode values will fulfill the need. There is another
issue to consider here though, how are
`table.exec.hive.infer-source-parallelism` and
`table.exec.hive.infer-source-parallelism.mode` compatible?
4. In FLIP-367 it is supported to be able to set the Source's parallelism
individually, if in the future HiveSource also supports this feature,
however, the default value of
`table.exec.hive.infer-source-parallelism.mode` is `InferMode. DYNAMIC`, at
this point will the parallelism be dynamically derived or will the manually
set parallelism take effect, and who has the higher priority?

Best,
Ron

Xia Sun  于2024年4月17日周三 12:08写道：

> Hi Jeyhun, Muhammet,
> Thanks for all the feedback!
>
> > Could you please mention the default values for the new configurations
> > (e.g., table.exec.hive.infer-source-parallelism.mode,
> > table.exec.hive.infer-source-parallelism.enabled,
> > etc) ?
>
>
> Thanks for your suggestion. I have supplemented the explanation regarding
> the default values.
>
> > Since we are introducing the mode as a configuration option,
> > could it make sense to have `InferMode.NONE` option also?
> > The `NONE` option would disable the inference.
>
>
> This is a good idea. Looking ahead, it could eliminate the need for
> introducing
> a new configuration option. I haven't identified any potential
> compatibility issues
> as yet. If there are no further ideas from others, I'll go ahead and update
> the FLIP to
> introducing InferMode.NONE.
>
> Best,
> Xia
>
> Muhammet Orazov  于2024年4月17日周三 10:31写道：
>
> > Hello Xia,
> >
> > Thanks for the FLIP!
> >
> > Since we are introducing the mode as a configuration option,
> > could it make sense to have `InferMode.NONE` option also?
> > The `NONE` option would disable the inference.
> >
> > This way we deprecate the `table.exec.hive.infer-source-parallelism`
> > and no additional `table.exec.hive.infer-source-parallelism.enabled`
> > option is required.
> >
> > What do you think?
> >
> > Best,
> > Muhammet
> >
> > On 2024-04-16 07:07, Xia Sun wrote:
> > > Hi everyone,
> > > I would like to start a discussion on FLIP-445: Support dynamic
> > > parallelism
> > > inference for HiveSource[1].
> > >
> > > FLIP-379[2] has introduced dynamic source parallelism inference for
> > > batch
> > > jobs, which can utilize runtime information to more accurately decide
> > > the
> > > source parallelism. As a follow-up task, we plan to implement the
> > > dynamic
> > > parallelism inference interface for HiveSource, and also switch the
> > > default
> > > static parallelism inference to dynamic parallelism inference.
> > >
> > > Looking forward to your feedback and suggestions, thanks.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-445%3A+Support+dynamic+parallelism+inference+for+HiveSource
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-379%3A+Dynamic+source+parallelism+inference+for+batch+jobs
> > >
> > > Best regards,
> > > Xia
> >
>

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-18 Thread Ron Liu

Hi, Xia

Thanks for your reply.

> That means, in terms
of priority, `table.exec.hive.infer-source-parallelism` >
`table.exec.hive.infer-source-parallelism.mode`.

I still have some confusion, if the
`table.exec.hive.infer-source-parallelism`
>`table.exec.hive.infer-source-parallelism.mode`, currently
`table.exec.hive.infer-source-parallelism` default value is true, that
means always static parallelism inference work? Or perhaps after this FLIP,
we changed the default behavior of
`table.exec.hive.infer-source-parallelism` to indicate dynamic parallelism
inference when enabled.
I think you should list the various behaviors of these two options that
coexist in FLIP by a table, only then users can know how the dynamic and
static parallelism inference work.

Best,
Ron

Xia Sun  于2024年4月18日周四 16:33写道：

> Hi Ron and Lijie,
> Thanks for joining the discussion and sharing your suggestions.
>
> > the InferMode class should also be introduced in the Public Interfaces
> > section!
>
>
> Thanks for the reminder, I have now added the InferMode class to the Public
> Interfaces section as well.
>
> > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked through
> > the code that the default value is 1000?
>
>
> I have checked and the default value of
> `table.exec.hive.infer-source-parallelism.max` is indeed 1000. This has
> been corrected in the FLIP.
>
> > how are`table.exec.hive.infer-source-parallelism` and
> > `table.exec.hive.infer-source-parallelism.mode` compatible?
>
>
> This is indeed a critical point. The current plan is to deprecate
> `table.exec.hive.infer-source-parallelism` but still utilize it as the main
> switch for enabling automatic parallelism inference. That means, in terms
> of priority, `table.exec.hive.infer-source-parallelism` >
> `table.exec.hive.infer-source-parallelism.mode`. In future versions, if
> `table.exec.hive.infer-source-parallelism` is removed, this logic will also
> need to be revised, leaving only
> `table.exec.hive.infer-source-parallelism.mode` as the basis for deciding
> whether to enable parallelism inference. I have also added this description
> to the FLIP.
>
>
> > In FLIP-367 it is supported to be able to set the Source's parallelism
> > individually, if in the future HiveSource also supports this feature,
> > however, the default value of
> > `table.exec.hive.infer-source-parallelism.mode` is `InferMode.DYNAMIC`,
> at
> > this point will the parallelism be dynamically derived or will the
> manually
> > set parallelism take effect, and who has the higher priority?
>
>
> From my understanding, 'manually set parallelism' has the higher priority,
> just like one of the preconditions for the effectiveness of dynamic
> parallelism inference in the AdaptiveBatchScheduler is that the vertex's
> parallelism isn't set. I believe whether it's static inference or dynamic
> inference, the manually set parallelism by the user should be respected.
>
> > The `InferMode.NONE` option.
>
> Currently, 'adding InferMode.NONE' seems to be the prevailing opinion. I
> will add InferMode.NONE as one of the Enum options in InferMode class.
>
> Best，
> Xia
>
> Lijie Wang  于2024年4月18日周四 13:50写道：
>
> > Thanks for driving the discussion.
> >
> > +1 for the proposal and +1 for the `InferMode.NONE` option.
> >
> > Best,
> > Lijie
> >
> > Ron liu  于2024年4月18日周四 11:36写道：
> >
> > > Hi, Xia
> > >
> > > Thanks for driving this FLIP.
> > >
> > > This proposal looks good to me overall. However, I have the following
> > minor
> > > questions:
> > >
> > > 1. FLIP introduced `table.exec.hive.infer-source-parallelism.mode` as a
> > new
> > > parameter, and the value is the enum class `InferMode`, I think the
> > > InferMode class should also be introduced in the Public Interfaces
> > section!
> > > 2. You mentioned in FLIP that the default value of
> > > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked
> through
> > > the code that the default value is 1000?
> > > 3. I also agree with Muhammet's idea that there is no need to introduce
> > the
> > > option `table.exec.hive.infer-source-parallelism.enabled`, and that
> > > expanding the InferMode values will fulfill the need. There is another
> > > issue to consider here though, how are
> > > `table.exec.hive.infer-source-parallelism` and
> > > `table.exec.hive.infer-source-parallelism.mode` compatible?
> > > 4. In FLIP-367 it is supported to be able to set the Source's
> parallelism
> > > ind

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-18 Thread Ron Liu

Hi, Xia

Thanks for updating, looks good to me.

Best,
Ron

Xia Sun  于2024年4月18日周四 19:11写道：

> Hi Ron,
> Yes, presenting it in a table might be more intuitive. I have already added
> the table in the "Public Interfaces | New Config Option" chapter of FLIP.
> PTAL~
>
> Ron Liu  于2024年4月18日周四 18:10写道：
>
> > Hi, Xia
> >
> > Thanks for your reply.
> >
> > > That means, in terms
> > of priority, `table.exec.hive.infer-source-parallelism` >
> > `table.exec.hive.infer-source-parallelism.mode`.
> >
> > I still have some confusion, if the
> > `table.exec.hive.infer-source-parallelism`
> > >`table.exec.hive.infer-source-parallelism.mode`, currently
> > `table.exec.hive.infer-source-parallelism` default value is true, that
> > means always static parallelism inference work? Or perhaps after this
> FLIP,
> > we changed the default behavior of
> > `table.exec.hive.infer-source-parallelism` to indicate dynamic
> parallelism
> > inference when enabled.
> > I think you should list the various behaviors of these two options that
> > coexist in FLIP by a table, only then users can know how the dynamic and
> > static parallelism inference work.
> >
> > Best,
> > Ron
> >
> > Xia Sun  于2024年4月18日周四 16:33写道：
> >
> > > Hi Ron and Lijie,
> > > Thanks for joining the discussion and sharing your suggestions.
> > >
> > > > the InferMode class should also be introduced in the Public
> Interfaces
> > > > section!
> > >
> > >
> > > Thanks for the reminder, I have now added the InferMode class to the
> > Public
> > > Interfaces section as well.
> > >
> > > > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked
> > through
> > > > the code that the default value is 1000?
> > >
> > >
> > > I have checked and the default value of
> > > `table.exec.hive.infer-source-parallelism.max` is indeed 1000. This has
> > > been corrected in the FLIP.
> > >
> > > > how are`table.exec.hive.infer-source-parallelism` and
> > > > `table.exec.hive.infer-source-parallelism.mode` compatible?
> > >
> > >
> > > This is indeed a critical point. The current plan is to deprecate
> > > `table.exec.hive.infer-source-parallelism` but still utilize it as the
> > main
> > > switch for enabling automatic parallelism inference. That means, in
> terms
> > > of priority, `table.exec.hive.infer-source-parallelism` >
> > > `table.exec.hive.infer-source-parallelism.mode`. In future versions, if
> > > `table.exec.hive.infer-source-parallelism` is removed, this logic will
> > also
> > > need to be revised, leaving only
> > > `table.exec.hive.infer-source-parallelism.mode` as the basis for
> deciding
> > > whether to enable parallelism inference. I have also added this
> > description
> > > to the FLIP.
> > >
> > >
> > > > In FLIP-367 it is supported to be able to set the Source's
> parallelism
> > > > individually, if in the future HiveSource also supports this feature,
> > > > however, the default value of
> > > > `table.exec.hive.infer-source-parallelism.mode` is
> `InferMode.DYNAMIC`,
> > > at
> > > > this point will the parallelism be dynamically derived or will the
> > > manually
> > > > set parallelism take effect, and who has the higher priority?
> > >
> > >
> > > From my understanding, 'manually set parallelism' has the higher
> > priority,
> > > just like one of the preconditions for the effectiveness of dynamic
> > > parallelism inference in the AdaptiveBatchScheduler is that the
> vertex's
> > > parallelism isn't set. I believe whether it's static inference or
> dynamic
> > > inference, the manually set parallelism by the user should be
> respected.
> > >
> > > > The `InferMode.NONE` option.
> > >
> > > Currently, 'adding InferMode.NONE' seems to be the prevailing opinion.
> I
> > > will add InferMode.NONE as one of the Enum options in InferMode class.
> > >
> > > Best，
> > > Xia
> > >
> > > Lijie Wang  于2024年4月18日周四 13:50写道：
> > >
> > > > Thanks for driving the discussion.
> > > >
> > > > +1 for the proposal and +1 for the `InferMode.NONE` option.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> >

[RESULT][VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-21 Thread Ron Liu

Hi, Dev

I'm happy to announce that FLIP-435: Introduce a New Materialized Table for
Simplifying Data Pipelines[1] has been accepted with 13 approving votes (8
binding) [2]

- Ron Liu(binding)
- Feng Jin
- Rui Fan(binding)
- Yuepeng Pan
- Ahmed Hamdy
- Ferenc Csaky
- Lincoln Lee(binding)
- Leonard Xu(binding)
- Jark Wu(binding)
- Yun Tang(binding)
- Jinsong Li(binding)
- Zhongqiang Gong
- Martijn Visser(binding)

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
[2] https://lists.apache.org/thread/woj27nsmx5xd7p87ryfo8h6gx37n3wlx

Best,
Ron

[DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-21 Thread Ron Liu

Hi, Dev

I would like to start a discussion about FLIP-448: Introduce Pluggable
Workflow Scheduler Interface for Materialized Table.

In FLIP-435[1], we proposed Materialized Table, which has two types of data
refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
mode, the Materialized Table relies on a workflow scheduler to perform
periodic refresh operation to achieve the desired data freshness.

There are numerous open-source workflow schedulers available, with popular
ones including Airflow and DolphinScheduler. To enable Materialized Table
to work with different workflow schedulers, we propose a pluggable workflow
scheduler interface for Materialized Table in this FLIP.

For more details, see FLIP-448 [2]. Looking forward to your feedback.

[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

Best,
Ron

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-25 Thread Ron Liu

 if the interface
is provided, we don't get the complete information and how does this
interface show the information about the background? So I don't think it is
necessary.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines

Best,
Ron



Feng Jin  于2024年4月25日周四 00:46写道：

> Hi Ron
>
> Thank you for initiating this FLIP.
>
> My current questions are as follows:
>
> 1. From my current understanding, the workflow handle should not be bound
> to the Dynamic Table. Therefore, if the workflow is modified, does it mean
> that the scheduling information corresponding to the Dynamic Table will be
> lost?
>
> 2. Regarding the status information of the workflow, I am wondering if it
> is necessary to provide an interface to display the backend scheduling
> information? This would make it more convenient to view the execution
> status of backend jobs.
>
>
> Best,
> Feng
>
>
> On Wed, Apr 24, 2024 at 3:24 PM 
> wrote:
>
> > Hello Ron Liu! Thank you for your FLIP!
> >
> > Here are my considerations:
> >
> > 1.
> > About the Operations interfaces, how can they be empty?
> > Should not they provide at least a `run` or `execute` method (similar to
> > the command pattern)?
> > In this way, their implementation can wrap all the implementations
> details
> > of particular schedulers, and the scheduler can simply execute the
> command.
> > In general, I think a simple sequence diagram showcasing the interaction
> > between the interfaces would be awesome to better understand the concept.
> >
> > 2.
> > What about the RefreshHandler, I cannot find a definition of its
> interface
> > here.
> > Is it out of scope for this FLIP?
> >
> > 3.
> > For the SqlGatewayService arguments:
> >
> > boolean isPeriodic,
> > @Nullable String scheduleTime,
> > @Nullable String scheduleTimeFormat,
> >
> > If it is periodic, where is the period?
> > For the scheduleTime and format, why not simply pass an instance of
> > LocalDateTime or similar? The gateway should not have the responsibility
> to
> > parse the time.
> >
> > 4.
> > For the REST API:
> > wouldn't it be better (more REST) to move the `mt_identifier` to the URL?
> > E.g.: v3/materialized_tables//refresh
> >
> > Thank you!
> > On Apr 22, 2024 at 08:42 +0200, Ron Liu , wrote:
> > > Hi, Dev
> > >
> > > I would like to start a discussion about FLIP-448: Introduce Pluggable
> > > Workflow Scheduler Interface for Materialized Table.
> > >
> > > In FLIP-435[1], we proposed Materialized Table, which has two types of
> > data
> > > refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
> > > mode, the Materialized Table relies on a workflow scheduler to perform
> > > periodic refresh operation to achieve the desired data freshness.
> > >
> > > There are numerous open-source workflow schedulers available, with
> > popular
> > > ones including Airflow and DolphinScheduler. To enable Materialized
> Table
> > > to work with different workflow schedulers, we propose a pluggable
> > workflow
> > > scheduler interface for Materialized Table in this FLIP.
> > >
> > > For more details, see FLIP-448 [2]. Looking forward to your feedback.
> > >
> > > [1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > >
> > > Best,
> > > Ron
> >
>

Re: [VOTE] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-25 Thread Ron Liu

+1(binding)

Best,
Ron

Rui Fan <1996fan...@gmail.com> 于2024年4月26日周五 12:55写道：

> +1(binding)
>
> Best,
> Rui
>
> On Fri, Apr 26, 2024 at 10:26 AM Muhammet Orazov
>  wrote:
>
> > Hey Xia,
> >
> > +1 (non-binding)
> >
> > Thanks and best,
> > Muhammet
> >
> > On 2024-04-26 02:21, Xia Sun wrote:
> > > Hi everyone,
> > >
> > > I'd like to start a vote on FLIP-445: Support dynamic parallelism
> > > inference
> > > for HiveSource[1] which has been discussed in this thread [2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> > > objection or
> > > not enough votes.
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-445%3A+Support+dynamic+parallelism+inference+for+HiveSource
> > > [2] https://lists.apache.org/thread/4k1qx6lodhbkknkhjyl0lq9bx8fcpjvn
> > >
> > >
> > > Best,
> > > Xia
> >
>

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-28 Thread Ron Liu

Hi, Shengkai

Thanks for your feedback and suggestion, it looks very useful for this
proposal, regarding your question I made the following optimization:

> *WorkflowScheduler*
> 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> 2. Could you give us an example about how to configure the scheduler?

1. Added a new WorkflowException, WorkflowScheduler's related method
signature will throw WorkflowException, when creating or modifying Workflow
encountered an exception, so that the framework will sense and deal with it.

2. Added a new Configuration section, introduced a new Option, and gave an
example of how to define the Scheduler in flink-conf.yaml.

> *SQL Gateway*
> 1. SqlGatewayService requires Session as the input, but the REST API
doesn't need any Session information.
> 2. Use "-" instead of "_" in the REST URI and camel case for fields in
request/response
> 3. Do we need scheduleTime and scheduleTimeFormat together?

1. If it is designed as a synchronous API, it may lead to network jitter,
thread resource exhaustion and other problems, which I have not considered
before. The asynchronous API, although increasing the cost of use for the
user, is friendly to the SqlGatewayService, as well as the Client thread
resources. In summary as discussed offline, so I also tend to think that
all APIs of SqlGateway should be unified, and all should be asynchronous
APIs, and bound to session. I have updated the REST API section in FLIP.

2. thanks for the reminder, it has been updated

3. After rethinking, I think it can indeed be simpler, there is no need to
pass in a custom time format, scheduleTime can be unified to the SQL
standard timestamp format: '-MM-dd HH:mm:ss', it is able to satisfy the
time related needs of materialized table.

Based on your feedback, I have optimized and updated the FLIP related
section.

Best,
Ron


Shengkai Fang  于2024年4月28日周日 15:47写道：

> Hi, Liu.
>
> Thanks for your proposal. I have some questions about the FLIP:
>
> *WorkflowScheduler*
>
> 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> 2. Could you give us an example about how to configure the scheduler?
>
> *SQL Gateway*
>
> 1. SqlGatewayService requires Session as the input, but the REST API
> doesn't need any Session information.
>
> From the perspective of a gateway developer, I tend to unify the API of the
> SQL gateway, binding all concepts to the session. On the one hand, this
> approach allows us to reduce maintenance and understanding costs, as we
> only need to maintain one set of architecture to complete basic concepts.
> On the other hand, the benefits of an asynchronous architecture are
> evident: we maintain state on the server side. If the request is a long
> connection, even in the face of network layer jitter, we can still find the
> original result through session and operation handles.
>
> Using asynchronous APIs may increase the development cost for users, but
> from a platform perspective, if a request remains in a blocking state for a
> long time, it also becomes a burden on the platform's JVM. This is because
> thread switching and maintenance require certain resources.
>
> 2. Use "-" instead of "_" in the REST URI and camel case for fields in
> request/response
>
> Please follow the Flink REST Design.
>
> 3. Do we need scheduleTime and scheduleTimeFormat together?
>
> I think we can use SQL timestamp format or ISO timestamp format. It is not
> necessary to pass time in any specific format.
>
> https://en.wikipedia.org/wiki/ISO_8601
>
> Best,
> Shengkai
>

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-28 Thread Ron Liu

Hi, Lorenzo

> I have a question there: how can the gateway update the refreshHandler in
the Catalog before getting it from the scheduler?

The refreshHandler in CatalogMateriazedTable is null before getting it from
the scheduler, you can look at the CatalogMaterializedTable.Builder[1] for
more details.

> You have a typo here: WorkflowScheudler -> WorkflowScheduler :)

Fix it now, thanks very much.

> For the operations part, I still think that the FLIP would benefit from
providing a specific pattern for operations. You could either propose a
command pattern [1] or a visitor pattern (where the scheduler visits the
operation to get relevant info) [2] for those operations at your choice.

Thank you for your input, I find it very useful. I tried to understand your
thinking through code and implemented the following pseudo code using the
visitor design pattern:
1. first defined WorkflowOperationVisitor, providing several overloaded
visit methods.

public interface WorkflowOperationVisitor {

 T visit(CreateWorkflowOperation
createWorkflowOperation);

void visit(ModifyWorkflowOperation operation);
}

2. then in the WorkflowOperation add the accept method.

@PublicEvolving
public interface WorkflowOperation {

void accept(WorkflowOperationVisitor visitor);
}


3. in the WorkflowScheduler call the implementation class of
WorkflowOperationVisitor, complete the corresponding operations.

I recognize this design pattern purely from a code design point of view,
but from the point of our specific scenario:
1. For CreateWorkflowOperation, the visit method needs to return
RefreshHandler, for ModifyWorkflowOperation, such as suspend and resume,
the visit method doesn't need to return RefreshHandler. parameter,
currently for different WorkflowOperation, WorkflowOperationVisitor#accept
can't be unified, so I think visitor may not be applicable here.

2. In addition, I think using the visitor pattern will add complexity to
the WorkflowScheduler implementer, which needs to implement one more
interface WorkflowOperationVisitor, this interface is not for the engine to
use, so I don't see any benefit from this design at the moment.

3. furthermore, I think the current does not see the benefits of the time,
simpler instead of better, similar to the design of
CatalogModificationEvent[2] and CatalogModificationListener[3], the
developer only needs instanceof judgment.

To summarize, I don't think there is a need to introduce command or visitor
pattern at present.

> About the REST API, I will wait for your offline discussion :)

After discussing with Shengkai offline, there is no need for this REST API
to support multiple tables to be refreshed at the same time, so it would be
more appropriate to put the materialized table identifier in the path of
the URL, thanks for the suggestion.

[1]
https://github.com/apache/flink/blob/e412402ca4dfc438e28fb990dc53ea7809430aee/flink-table/flink-table-common/src/main/java/org/apache/flink/table/catalog/CatalogMaterializedTable.java#L264
[2]
https://github.com/apache/flink/blob/b1544e4e513d2b75b350c20dbb1c17a8232c22fd/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/catalog/listener/CatalogModificationEvent.java#L28
[3]
https://github.com/apache/flink/blob/b1544e4e513d2b75b350c20dbb1c17a8232c22fd/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/catalog/listener/CatalogModificationListener.java#L31

Best,
Ron

Ron Liu  于2024年4月28日周日 23:53写道：

> Hi, Shengkai
>
> Thanks for your feedback and suggestion, it looks very useful for this
> proposal, regarding your question I made the following optimization:
>
> > *WorkflowScheduler*
> > 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> > 2. Could you give us an example about how to configure the scheduler?
>
> 1. Added a new WorkflowException, WorkflowScheduler's related method
> signature will throw WorkflowException, when creating or modifying Workflow
> encountered an exception, so that the framework will sense and deal with it.
>
> 2. Added a new Configuration section, introduced a new Option, and gave an
> example of how to define the Scheduler in flink-conf.yaml.
>
> > *SQL Gateway*
> > 1. SqlGatewayService requires Session as the input, but the REST API
> doesn't need any Session information.
> > 2. Use "-" instead of "_" in the REST URI and camel case for fields in
> request/response
> > 3. Do we need scheduleTime and scheduleTimeFormat together?
>
> 1. If it is designed as a synchronous API, it may lead to network jitter,
> thread resource exhaustion and other problems, which I have not considered
> before. The asynchronous API, although increasing the cost of use for the
> user, is friendly to the SqlGatewayService, as well as the Client thread
> resources. In summary as discussed offline, so I al

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-04 Thread Ron Liu

Hi, Lincoln

Thanks for join this discussion.

After rethinking, I think your suggestion is make sense, although currently
deleting the workflow on the Scheduler and relying only on the
RefreshHandler is enough, if in the future we support cascading deletion,
the DeleteWorkflowOperation can provide the necessary information without
the need to provide a new interface.

I've updated the public interface section of FLIP.

Best,
Ron

Lincoln Lee  于2024年4月30日周二 21:27写道：

> Thanks Ron for starting this flip! It will complete the user story for
> flip-435[1].
>
> Regarding the WorkflowOperation, I have a question about whether we
> should add Delete/DropWorkflowOperation as well for when the
> Materialized Table is dropped or refresh mode changed from full to
> continuous?
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines?src=contextnavpagetreemode
>
>
> Best,
> Lincoln Lee
>
>
>  于2024年4月30日周二 15:37写道：
>
> > Hello Ron, thank you for your detailed answers!
> >
> > For the Visitor pattern, I thought about it the other way around, so that
> > operations visit the scheduler, and not vice-versa :) In this way
> > operations can get the required information in order to be executed in a
> > tailored way.
> >
> > Thank you for your effort, but, as you say:
> > > furthermore, I think the current does not see the benefits of the time,
> > simpler instead of better, similar to the design of
> > CatalogModificationEvent[2] and CatalogModificationListener[3], the
> > developer only needs instanceof judgment.
> >
> > In java, most of the times, `instanceof` is considered an anti-pattern,
> > that's why I was also thinking about a command pattern (every operations
> > defines an `execute` method). However, I also understand this part is not
> > crucial for the FLIP under discussion, and the implementation details can
> > simply wait for the PRs to come.
> >
> > > After discussing with Shengkai offline, there is no need for this REST
> > API
> > to support multiple tables to be refreshed at the same time, so it would
> be
> > more appropriate to put the materialized table identifier in the path of
> > the URL, thanks for the suggestion.
> >
> > Very good!
> >
> > Thank you!
> > On Apr 29, 2024 at 05:04 +0200, Ron Liu , wrote:
> > > Hi, Lorenzo
> > >
> > > > I have a question there: how can the gateway update the
> refreshHandler
> > in
> > > the Catalog before getting it from the scheduler?
> > >
> > > The refreshHandler in CatalogMateriazedTable is null before getting it
> > from
> > > the scheduler, you can look at the CatalogMaterializedTable.Builder[1]
> > for
> > > more details.
> > >
> > > > You have a typo here: WorkflowScheudler -> WorkflowScheduler :)
> > >
> > > Fix it now, thanks very much.
> > >
> > > > For the operations part, I still think that the FLIP would benefit
> from
> > > providing a specific pattern for operations. You could either propose a
> > > command pattern [1] or a visitor pattern (where the scheduler visits
> the
> > > operation to get relevant info) [2] for those operations at your
> choice.
> > >
> > > Thank you for your input, I find it very useful. I tried to understand
> > your
> > > thinking through code and implemented the following pseudo code using
> the
> > > visitor design pattern:
> > > 1. first defined WorkflowOperationVisitor, providing several overloaded
> > > visit methods.
> > >
> > > public interface WorkflowOperationVisitor {
> > >
> > >  T visit(CreateWorkflowOperation
> > > createWorkflowOperation);
> > >
> > > void visit(ModifyWorkflowOperation operation);
> > > }
> > >
> > > 2. then in the WorkflowOperation add the accept method.
> > >
> > > @PublicEvolving
> > > public interface WorkflowOperation {
> > >
> > > void accept(WorkflowOperationVisitor visitor);
> > > }
> > >
> > >
> > > 3. in the WorkflowScheduler call the implementation class of
> > > WorkflowOperationVisitor, complete the corresponding operations.
> > >
> > > I recognize this design pattern purely from a code design point of
> view,
> > > but from the point of our specific scenario:
> > > 1. For CreateWorkflowOperation, the visit method needs to return
> > > RefreshHandler, for ModifyWorkflowOp

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-06 Thread Ron Liu

Hi, Xuyang

Thanks for joining this discussion

> 1. In the sequence diagram, it appears that there is a missing step for
obtaining the refresh handler from the catalog during the suspend operation.

Good catch

> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
The workflow it creates is marked as a "one-time workflow". This is
different

from a "periodic workflow," and it appears to be a one-off execution. Is
this actually referring to the Refresh command in FLIP-435?

The cascade refresh is a future work, we don't propose the corresponding
syntax in FLIP-435. However, intuitively, it would be an extension of the
Refresh command in FLIP-435.

> 3. The workflow-scheduler.type has no default value; should it be set to
CRON by default?

Firstly, CRON is not a workflow scheduler. Secondly, I believe that
configuring the Scheduler should be an action that users are aware of, and
default values should not be set.

> 4. It appears that in the section on `public interfaces`, within
`WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to

`CreateWorkflowOperation`, right?

Sorry, I don't get your point. Can you give more description?

Best,
Ron

Xuyang  于2024年5月6日周一 20:26写道：

> Hi, Ron.
>
> Thanks for driving this. After reading the entire flip, I have the
> following questions:
>
>
>
>
> 1. In the sequence diagram, it appears that there is a missing step for
> obtaining the refresh handler from the catalog during the suspend operation.
>
>
>
>
> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
> The workflow it creates is marked as a "one-time workflow". This is
> different
>
> from a "periodic workflow," and it appears to be a one-off execution. Is
> this actually referring to the Refresh command in FLIP-435?
>
>
>
>
> 3. The workflow-scheduler.type has no default value; should it be set to
> CRON by default?
>
>
>
>
> 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
>
>
>
> --
>
> Best！
> Xuyang
>
>
>
>
>
> At 2024-04-22 14:41:39, "Ron Liu"  wrote:
> >Hi, Dev
> >
> >I would like to start a discussion about FLIP-448: Introduce Pluggable
> >Workflow Scheduler Interface for Materialized Table.
> >
> >In FLIP-435[1], we proposed Materialized Table, which has two types of
> data
> >refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
> >mode, the Materialized Table relies on a workflow scheduler to perform
> >periodic refresh operation to achieve the desired data freshness.
> >
> >There are numerous open-source workflow schedulers available, with popular
> >ones including Airflow and DolphinScheduler. To enable Materialized Table
> >to work with different workflow schedulers, we propose a pluggable
> workflow
> >scheduler interface for Materialized Table in this FLIP.
> >
> >For more details, see FLIP-448 [2]. Looking forward to your feedback.
> >
> >[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> >[2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> >
> >Best,
> >Ron
>

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-06 Thread Ron Liu

> 4. It appears that in the section on `public interfaces`, within
`WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to

`CreateWorkflowOperation`, right?

After discussing with Xuyang offline, we need to support periodic workflow
and one-time workflow, they need different information, for example,
periodic workflow needs cron expression, one-time workflow needs refresh
partition, downstream cascade materialized table, etc. Therefore,
CreateWorkflowOperation correspondingly will have two different
implementation classes, which will be cleaner for both the implementer and
the caller.

Best,
Ron

Ron Liu  于2024年5月6日周一 20:48写道：

> Hi, Xuyang
>
> Thanks for joining this discussion
>
> > 1. In the sequence diagram, it appears that there is a missing step for
> obtaining the refresh handler from the catalog during the suspend operation.
>
> Good catch
>
> > 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
> The workflow it creates is marked as a "one-time workflow". This is
> different
>
> from a "periodic workflow," and it appears to be a one-off execution. Is
> this actually referring to the Refresh command in FLIP-435?
>
> The cascade refresh is a future work, we don't propose the corresponding
> syntax in FLIP-435. However, intuitively, it would be an extension of the
> Refresh command in FLIP-435.
>
> > 3. The workflow-scheduler.type has no default value; should it be set to
> CRON by default?
>
> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
> configuring the Scheduler should be an action that users are aware of, and
> default values should not be set.
>
> > 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
> Sorry, I don't get your point. Can you give more description?
>
> Best,
> Ron
>
> Xuyang  于2024年5月6日周一 20:26写道：
>
>> Hi, Ron.
>>
>> Thanks for driving this. After reading the entire flip, I have the
>> following questions:
>>
>>
>>
>>
>> 1. In the sequence diagram, it appears that there is a missing step for
>> obtaining the refresh handler from the catalog during the suspend operation.
>>
>>
>>
>>
>> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
>> The workflow it creates is marked as a "one-time workflow". This is
>> different
>>
>> from a "periodic workflow," and it appears to be a one-off execution. Is
>> this actually referring to the Refresh command in FLIP-435?
>>
>>
>>
>>
>> 3. The workflow-scheduler.type has no default value; should it be set to
>> CRON by default?
>>
>>
>>
>>
>> 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>>
>>
>>
>> --
>>
>> Best！
>> Xuyang
>>
>>
>>
>>
>>
>> At 2024-04-22 14:41:39, "Ron Liu"  wrote:
>> >Hi, Dev
>> >
>> >I would like to start a discussion about FLIP-448: Introduce Pluggable
>> >Workflow Scheduler Interface for Materialized Table.
>> >
>> >In FLIP-435[1], we proposed Materialized Table, which has two types of
>> data
>> >refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
>> >mode, the Materialized Table relies on a workflow scheduler to perform
>> >periodic refresh operation to achieve the desired data freshness.
>> >
>> >There are numerous open-source workflow schedulers available, with
>> popular
>> >ones including Airflow and DolphinScheduler. To enable Materialized Table
>> >to work with different workflow schedulers, we propose a pluggable
>> workflow
>> >scheduler interface for Materialized Table in this FLIP.
>> >
>> >For more details, see FLIP-448 [2]. Looking forward to your feedback.
>> >
>> >[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
>> >[2]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
>> >
>> >Best,
>> >Ron
>>
>

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-07 Thread Ron Liu

Hi, dev

Following the recent PoC[1], and drawing on the excellent code design
within Flink, I have made the following optimizations to the Public
Interfaces section of FLIP:

1. I have renamed WorkflowOperation to RefreshWorkflow. This change better
conveys its purpose. RefreshWorkflow is used to provide the necessary
information required for creating, modifying, and deleting workflows. Using
WorkflowOperation could mislead people into thinking it is a command
operation, whereas in fact, it does not represent an operation but merely
provides the essential context information for performing operations on
workflows. The specific operations are completed within WorkflowScheduler.
Additionally, I felt that using WorkflowOperation could potentially
conflict with the Operation[2] interface in the table.
2. I have refined the signatures of the modifyRefreshWorkflow and
deleteRefreshWorkflow interface methods in WorkflowScheduler. The parameter
T refreshHandler is now provided by ModifyRefreshWorkflow and
deleteRefreshWorkflow, which makes the overall interface design more
symmetrical and clean.

[1] https://github.com/lsyldliu/flink/tree/FLIP-448-PoC
[2]
https://github.com/apache/flink/blob/29736b8c01924b7da03d4bcbfd9c812a8e5a08b4/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/operations/Operation.java

Best,
Ron

Ron Liu  于2024年5月7日周二 14:30写道：

> > 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
> After discussing with Xuyang offline, we need to support periodic workflow
> and one-time workflow, they need different information, for example,
> periodic workflow needs cron expression, one-time workflow needs refresh
> partition, downstream cascade materialized table, etc. Therefore,
> CreateWorkflowOperation correspondingly will have two different
> implementation classes, which will be cleaner for both the implementer and
> the caller.
>
> Best,
> Ron
>
> Ron Liu  于2024年5月6日周一 20:48写道：
>
>> Hi, Xuyang
>>
>> Thanks for joining this discussion
>>
>> > 1. In the sequence diagram, it appears that there is a missing step for
>> obtaining the refresh handler from the catalog during the suspend operation.
>>
>> Good catch
>>
>> > 2. The term "cascade refresh" does not seem to be mentioned in
>> FLIP-435. The workflow it creates is marked as a "one-time workflow". This
>> is different
>>
>> from a "periodic workflow," and it appears to be a one-off execution. Is
>> this actually referring to the Refresh command in FLIP-435?
>>
>> The cascade refresh is a future work, we don't propose the corresponding
>> syntax in FLIP-435. However, intuitively, it would be an extension of the
>> Refresh command in FLIP-435.
>>
>> > 3. The workflow-scheduler.type has no default value; should it be set
>> to CRON by default?
>>
>> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
>> configuring the Scheduler should be an action that users are aware of, and
>> default values should not be set.
>>
>> > 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>> Sorry, I don't get your point. Can you give more description?
>>
>> Best,
>> Ron
>>
>> Xuyang  于2024年5月6日周一 20:26写道：
>>
>>> Hi, Ron.
>>>
>>> Thanks for driving this. After reading the entire flip, I have the
>>> following questions:
>>>
>>>
>>>
>>>
>>> 1. In the sequence diagram, it appears that there is a missing step for
>>> obtaining the refresh handler from the catalog during the suspend operation.
>>>
>>>
>>>
>>>
>>> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
>>> The workflow it creates is marked as a "one-time workflow". This is
>>> different
>>>
>>> from a "periodic workflow," and it appears to be a one-off execution. Is
>>> this actually referring to the Refresh command in FLIP-435?
>>>
>>>
>>>
>>>
>>> 3. The workflow-scheduler.type has no default value; should it be set to
>>> CRON by default?
>>>
>>>
>>>
>>>
>>> 4. It appears that in the section on `public interfaces`, within
>>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>>
>>> `Crea

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu

Hi, Dev

Thank you all for joining this thread and giving your comments and
suggestions, they have helped improve this proposal and I look forward to
further feedback.
If there are no further comments, I'd like to close the discussion and
start the voting one day later.

Best,
Ron

Ron Liu  于2024年5月7日周二 20:51写道：

> Hi, dev
>
> Following the recent PoC[1], and drawing on the excellent code design
> within Flink, I have made the following optimizations to the Public
> Interfaces section of FLIP:
>
> 1. I have renamed WorkflowOperation to RefreshWorkflow. This change better
> conveys its purpose. RefreshWorkflow is used to provide the necessary
> information required for creating, modifying, and deleting workflows. Using
> WorkflowOperation could mislead people into thinking it is a command
> operation, whereas in fact, it does not represent an operation but merely
> provides the essential context information for performing operations on
> workflows. The specific operations are completed within WorkflowScheduler.
> Additionally, I felt that using WorkflowOperation could potentially
> conflict with the Operation[2] interface in the table.
> 2. I have refined the signatures of the modifyRefreshWorkflow and
> deleteRefreshWorkflow interface methods in WorkflowScheduler. The parameter
> T refreshHandler is now provided by ModifyRefreshWorkflow and
> deleteRefreshWorkflow, which makes the overall interface design more
> symmetrical and clean.
>
> [1] https://github.com/lsyldliu/flink/tree/FLIP-448-PoC
> [2]
> https://github.com/apache/flink/blob/29736b8c01924b7da03d4bcbfd9c812a8e5a08b4/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/operations/Operation.java
>
> Best,
> Ron
>
> Ron Liu  于2024年5月7日周二 14:30写道：
>
>> > 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>> After discussing with Xuyang offline, we need to support periodic
>> workflow and one-time workflow, they need different information, for
>> example, periodic workflow needs cron expression, one-time workflow needs
>> refresh partition, downstream cascade materialized table, etc. Therefore,
>> CreateWorkflowOperation correspondingly will have two different
>> implementation classes, which will be cleaner for both the implementer and
>> the caller.
>>
>> Best,
>> Ron
>>
>> Ron Liu  于2024年5月6日周一 20:48写道：
>>
>>> Hi, Xuyang
>>>
>>> Thanks for joining this discussion
>>>
>>> > 1. In the sequence diagram, it appears that there is a missing step
>>> for obtaining the refresh handler from the catalog during the suspend
>>> operation.
>>>
>>> Good catch
>>>
>>> > 2. The term "cascade refresh" does not seem to be mentioned in
>>> FLIP-435. The workflow it creates is marked as a "one-time workflow". This
>>> is different
>>>
>>> from a "periodic workflow," and it appears to be a one-off execution. Is
>>> this actually referring to the Refresh command in FLIP-435?
>>>
>>> The cascade refresh is a future work, we don't propose the corresponding
>>> syntax in FLIP-435. However, intuitively, it would be an extension of the
>>> Refresh command in FLIP-435.
>>>
>>> > 3. The workflow-scheduler.type has no default value; should it be set
>>> to CRON by default?
>>>
>>> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
>>> configuring the Scheduler should be an action that users are aware of, and
>>> default values should not be set.
>>>
>>> > 4. It appears that in the section on `public interfaces`, within
>>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>>
>>> `CreateWorkflowOperation`, right?
>>>
>>> Sorry, I don't get your point. Can you give more description?
>>>
>>> Best,
>>> Ron
>>>
>>> Xuyang  于2024年5月6日周一 20:26写道：
>>>
>>>> Hi, Ron.
>>>>
>>>> Thanks for driving this. After reading the entire flip, I have the
>>>> following questions:
>>>>
>>>>
>>>>
>>>>
>>>> 1. In the sequence diagram, it appears that there is a missing step for
>>>> obtaining the refresh handler from the catalog during the suspend 
>>>> operation.
>>>>
>>>>
>>>>
>>>>
>>>> 2. The term "ca

[VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu

Hi Dev, Thank you to everyone for the feedback on FLIP-448: Introduce
Pluggable Workflow Scheduler Interface for Materialized Table[1][2]. I'd
like to start a vote for it. The vote will be open for at least 72 hours
unless there is an objection or not enough votes. [1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

[2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
Best, Ron

Re: [VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu

Sorry for the re-post, just to format this email content.

Hi Dev

Thank you to everyone for the feedback on FLIP-448: Introduce Pluggable
Workflow Scheduler Interface for Materialized Table[1][2].
I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

[2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1

Best,
Ron

Ron Liu  于2024年5月9日周四 13:52写道：

> Hi Dev, Thank you to everyone for the feedback on FLIP-448: Introduce
> Pluggable Workflow Scheduler Interface for Materialized Table[1][2]. I'd
> like to start a vote for it. The vote will be open for at least 72 hours
> unless there is an objection or not enough votes. [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
>
> [2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> Best, Ron
>

Re: Re: [VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-09 Thread Ron Liu

+1(binding)

Best,
Ron

Jark Wu  于2024年5月10日周五 09:51写道：

> +1 (binding)
>
> Best,
> Jark
>
> On Thu, 9 May 2024 at 21:27, Lincoln Lee  wrote:
>
> > +1 (binding)
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Feng Jin  于2024年5月9日周四 19:45写道：
> >
> > > +1 (non-binding)
> > >
> > >
> > > Best,
> > > Feng
> > >
> > >
> > > On Thu, May 9, 2024 at 7:37 PM Xuyang  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > > --
> > > >
> > > > Best！
> > > > Xuyang
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2024-05-09 13:57:07, "Ron Liu"  wrote:
> > > > >Sorry for the re-post, just to format this email content.
> > > > >
> > > > >Hi Dev
> > > > >
> > > > >Thank you to everyone for the feedback on FLIP-448: Introduce
> > Pluggable
> > > > >Workflow Scheduler Interface for Materialized Table[1][2].
> > > > >I'd like to start a vote for it. The vote will be open for at least
> 72
> > > > >hours unless there is an objection or not enough votes.
> > > > >
> > > > >[1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > > >
> > > > >[2]
> https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > > >
> > > > >Best,
> > > > >Ron
> > > > >
> > > > >Ron Liu  于2024年5月9日周四 13:52写道：
> > > > >
> > > > >> Hi Dev, Thank you to everyone for the feedback on FLIP-448:
> > Introduce
> > > > >> Pluggable Workflow Scheduler Interface for Materialized
> Table[1][2].
> > > I'd
> > > > >> like to start a vote for it. The vote will be open for at least 72
> > > hours
> > > > >> unless there is an objection or not enough votes. [1]
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > > >>
> > > > >> [2]
> > https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > > >> Best, Ron
> > > > >>
> > > >
> > >
> >
>

[RESULT][VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-13 Thread Ron Liu

Hi, Dev

I'm happy to announce that FLIP-448: Introduce Pluggable Workflow Scheduler
Interface for Materialized Table[1] has been accepted with 8 approving
votes (4 binding) [2].

- Xuyang
- Feng Jin
- Lincoln Lee(binding)
- Jark Wu(binding)
- Ron Liu(binding)
- Shengkai Fang(binding)
- Keith Lee
- Ahmed Hamdy

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
[2] https://lists.apache.org/thread/8qvh3brgvo46xprv4mxq4kyhyy0tsvny

Best,
Ron

Re: [DISCUSSION] FLIP-457: Improve Table/SQL Configuration for Flink 2.0

2024-05-19 Thread Ron Liu

Hi, Lincoln

>  2. Regarding the options in HashAggCodeGenerator, since this new feature
has gone
through a couple of release cycles and could be considered for
PublicEvolving now,
cc @Ron Liu   WDYT?

Thanks for cc'ing me,  +1 for public these options now.

Best,
Ron

Benchao Li  于2024年5月20日周一 13:08写道：

> I agree with Lincoln about the experimental features.
>
> Some of these configurations do not even have proper implementation,
> take 'table.exec.range-sort.enabled' as an example, there was a
> discussion[1] about it before.
>
> [1] https://lists.apache.org/thread/q5h3obx36pf9po28r0jzmwnmvtyjmwdr
>
> Lincoln Lee  于2024年5月20日周一 12:01写道：
> >
> > Hi Jane,
> >
> > Thanks for the proposal!
> >
> > +1 for the changes except for these annotated as experimental ones.
> >
> > For the options annotated as experimental,
> >
> > +1 for the moving of IncrementalAggregateRule & RelNodeBlock.
> >
> > For the rest of the options, there are some suggestions:
> >
> > 1. for the batch related parameters, it's recommended to either delete
> > them (leaving the necessary defaults value in place) or leave them as
> they
> > are. Including:
> > FlinkRelMdRowCount
> > FlinkRexUtil
> > BatchPhysicalSortRule
> > JoinDeriveNullFilterRule
> > BatchPhysicalJoinRuleBase
> > BatchPhysicalSortMergeJoinRule
> >
> > What I understand about the history of these options is that they were
> once
> > used for fine
> > tuning for tpc testing, and the current flink planner no longer relies on
> > these internal
> > options when testing tpc[1]. In addition, these options are too obscure
> for
> > SQL users,
> > and some of them are actually magic numbers.
> >
> > 2. Regarding the options in HashAggCodeGenerator, since this new feature
> > has gone
> > through a couple of release cycles and could be considered for
> > PublicEvolving now,
> > cc @Ron Liu   WDYT?
> >
> > 3. Regarding WindowEmitStrategy, IIUC it is currently unsupported on TVF
> > window, so
> > it's recommended to keep it untouched for now and follow up in
> > FLINK-29692[2]. cc @Xuyang 
> >
> > [1]
> >
> https://github.com/ververica/flink-sql-benchmark/blob/master/tools/common/flink-conf.yaml
> > [2] https://issues.apache.org/jira/browse/FLINK-29692
> >
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Yubin Li  于2024年5月17日周五 10:49写道：
> >
> > > Hi Jane,
> > >
> > > Thank Jane for driving this proposal !
> > >
> > > This makes sense for users, +1 for that.
> > >
> > > Best,
> > > Yubin
> > >
> > > On Thu, May 16, 2024 at 3:17 PM Jark Wu  wrote:
> > > >
> > > > Hi Jane,
> > > >
> > > > Thanks for the proposal. +1 from my side.
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Thu, 16 May 2024 at 10:28, Xuannan Su 
> wrote:
> > > >
> > > > > Hi Jane,
> > > > >
> > > > > Thanks for driving this effort! And +1 for the proposed changes.
> > > > >
> > > > > I have one comment on the migration plan.
> > > > >
> > > > > For options to be moved to another module/package, I think we have
> to
> > > > > mark the old option deprecated in 1.20 for it to be removed in 2.0,
> > > > > according to the API compatibility guarantees[1]. We can introduce
> the
> > > > > new option in 1.20 with the same option key in the intended class.
> > > > > WDYT?
> > > > >
> > > > > Best,
> > > > > Xuannan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#api-compatibility-guarantees
> > > > >
> > > > >
> > > > >
> > > > > On Wed, May 15, 2024 at 6:20 PM Jane Chan 
> > > wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'd like to start a discussion on FLIP-457: Improve Table/SQL
> > > > > Configuration
> > > > > > for Flink 2.0 [1]. This FLIP revisited all Table/SQL
> configurations
> > > to
> > > > > > improve user-friendliness and maintainability as Flink moves
> toward
> > > 2.0.
> > > > > >
> > > > > > I am looking forward to your feedback.
> > > > > >
> > > > > > Best regards,
> > > > > > Jane
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=307136992
> > > > >
> > >
>
>
>
> --
>
> Best,
> Benchao Li
>

Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

2024-06-16 Thread Ron Liu

Hi, Sergio

Sorry for later joining this thread.

Thanks for driving this proposal, it looks great.

I have a few questions:

1. Many SQL-based data processing systems, however, support schema
definitions within their CTAS statements

Is it possible for you to list these systems and make it part of the
Reference chapter? Which would make it easier for everyone to understand.


2. This proposal proposes to support defining primary keys in CTAS, then
there are two possible issues here. One is whether Nullable columns are
allowed to be referenced in the primary key definition, because all the
columns deduced based on Select Query may be Nullable; and the second is
that if the UNIQUE KEY cannot be deduced based on Select Query, or the
deduced UNIQUE KEY does not match the defined primary key, which will lead
to data duplication, the engine needs to what to ensure the uniqueness of
the data?


Best
Ron


Jeyhun Karimov  于2024年6月14日周五 01:51写道：

> Thanks Sergio and Timo for your answers.
> Sounds good to me.
> Looking forward for this feature.
>
> Regards,
> Jeyhun
>
> On Thu, Jun 13, 2024 at 4:48 PM Sergio Pena 
> wrote:
>
> > Sure Yuxia, I just added the support for RTAS statements too.
> >
> > - Sergio
> >
> > On Wed, Jun 12, 2024 at 8:22 PM yuxia 
> wrote:
> >
> > > Hi, Sergio.
> > > Thanks for driving the FLIP. Given we also has REPLACE TABLE AS
> > > Statement[1] and it's almost same with CREATE TABLE AS Statement,
> > > would you mind also supporting schema definition for REPLACE TABLE AS
> > > Statement in this FLIP? It'll be a great to align REPLACE TABLE AS
> > Statement
> > > to CREATE TABLE AS Statement
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> > >
> > > Best regards,
> > > Yuxia
> > >
> > > - 原始邮件 -
> > > 发件人: "Timo Walther" 
> > > 收件人: "dev" 
> > > 发送时间: 星期三, 2024年 6 月 12日 下午 10:19:14
> > > 主题: Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS
> > Statement
> > >
> > > > I just noticed the CREATE TABLE LIKE statement allows the definition
> > >  > of new columns in the CREATE part. The difference
> > >  > with this CTAS proposal is that TABLE LIKE appends the new columns
> at
> > >  > the end of the schema instead of adding them
> > >  > at the beginning like this proposal and Mysql do.
> > >
> > > This should be fine. The LIKE rather "extends from" the right table.
> > > Whereas the SELECT in CTAS rather extends the left schema definition.
> > > Given that "the extended part" is always appended, we could argue that
> > > the current CTAS behavior in the FLIP is acceptable.
> > >
> > >  > If you want to rename a column in the create part, then that column
> > >  > position must be in the same position as the query column.
> > >  > I didn't like the Postgres approach because it does not let us add
> > >  > columns that do not exist in the query schema.
> > >
> > > Flink offers similar functionality in INSERT INTO. INSERT INTO also
> > > allows syntax like: `INSERT INTO (b, c) SELECT * FROM t`. So given that
> > > our CTAS can be seen as a CREATE TABLE + INSERT INTO. I would adopt
> > > Jeyhun comment in the FLIP. If users don't want to add additional
> schema
> > > parts, the same column reordering should be available as if one would
> > > write a INSERT INTO.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > >
> > >
> > > On 12.06.24 04:30, Yanquan Lv wrote:
> > > > Hi Sergio, thanks for driving it, +1 for this.
> > > >
> > > > I have some comments:
> > > > 1. If we have a source table with primary keys and partition keys
> > > defined,
> > > > what is the default behavior if PARTITIONED and DISTRIBUTED not
> > specified
> > > > in the CTAS statement, It should not be inherited by default？
> > > > 2. I suggest providing a complete syntax that includes
> table_properties
> > > > like FLIP-218.
> > > >
> > > >
> > > > Sergio Pena  于2024年6月12日周三 03:54写道：
> > > >
> > > >> I just noticed the CREATE TABLE LIKE statement allows the definition
> > of
> > > new
> > > >> columns in the CREATE part. The difference
> > > >> with this CTAS proposal is that TABLE LIKE appends the new columns
> at
> > > the
> > > >> end of the schema instead of adding them
> > > >> at the beginning like this proposal and Mysql do.
> > > >>
> > > >>> create table t1(id int, name string);
> > >  create table s1(a int, b string) like t1;
> > >  describe s1;
> > > >>
> > > >> +-+---+--++
> > > >>> | Column Name | Data Type | Nullable | Extras |
> > > >>> +-+---+--++
> > > >>> | id  | INT   | NULL ||
> > > >>> | name| STRING| NULL ||
> > > >>> | a   | INT   | NULL ||
> > > >>> | b   | STRING| NULL ||
> > > >>> +-+---+--++
> > > >>
> > > >>
> > > >>
> > > >> The CREATE TABLE LIKE also does not let the definition of

Re: [ANNOUNCE] New Apache Flink Committer - Hang Ruan

2024-06-17 Thread Ron Liu

Congratulations, Hang!

Best,
Ron

Geng Biao  于2024年6月17日周一 12:35写道：

> Congrats, Hang!
> Best,
> Biao Geng
>
> 发送自 Outlook for iOS
> 
> 发件人: Zakelly Lan 
> 发送时间: Monday, June 17, 2024 12:12:10 PM
> 收件人: dev@flink.apache.org 
> 主题: Re: [ANNOUNCE] New Apache Flink Committer - Hang Ruan
>
> Congratulations, Hang!
>
>
> Best,
> Zakelly
>
> On Mon, Jun 17, 2024 at 12:07 PM Yanquan Lv  wrote:
>
> > Congratulations, Hang!
> >
> > Samrat Deb  于2024年6月17日周一 11:32写道：
> >
> > > Congratulations Hang Ruan !
> > >
> > > Bests,
> > > Samrat
> > >
> > > On Mon, Jun 17, 2024 at 8:47 AM Leonard Xu  wrote:
> > >
> > > > Hi everyone,
> > > > On behalf of the PMC, I'm happy to let you know that Hang Ruan has
> > become
> > > > a new Flink Committer !
> > > >
> > > > Hang Ruan has been continuously contributing to the Flink project
> since
> > > > August 2021. Since then, he has continuously contributed to Flink,
> > Flink
> > > > CDC, and various Flink connector repositories, including
> > > > flink-connector-kafka, flink-connector-elasticsearch,
> > > flink-connector-aws,
> > > > flink-connector-rabbitmq, flink-connector-pulsar, and
> > > > flink-connector-mongodb. Hang Ruan focuses on the improvements
> related
> > to
> > > > connectors and catalogs and initiated FLIP-274. He is most recognized
> > as
> > > a
> > > > core contributor and maintainer for the Flink CDC project,
> contributing
> > > > many features such as MySQL CDC newly table addition and the Schema
> > > > Evolution feature.
> > > >
> > > > Beyond his technical contributions, Hang Ruan is an active member of
> > the
> > > > Flink community. He regularly engages in discussions on the Flink dev
> > > > mailing list and the user-zh and user mailing lists, participates in
> > FLIP
> > > > discussions, assists with user Q&A, and consistently volunteers for
> > > release
> > > > verifications.
> > > >
> > > > Please join me in congratulating Hang Ruan for becoming an Apache
> Flink
> > > > committer!
> > > >
> > > > Best,
> > > > Leonard (on behalf of the Flink PMC)
> > >
> >
>

Re: [ANNOUNCE] New Apache Flink Committer - Zhongqiang Gong

2024-06-17 Thread Ron Liu

Congratulations, Hang!

Best,
Ron

Geng Biao  于2024年6月17日周一 12:35写道：

> Congratulations, Zhongqiang!
> Best,
> Biao Geng
>
> 发送自 Outlook for iOS
> 
> 发件人: Zakelly Lan 
> 发送时间: Monday, June 17, 2024 12:11:47 PM
> 收件人: dev@flink.apache.org 
> 主题: Re: [ANNOUNCE] New Apache Flink Committer - Zhongqiang Gong
>
> Congratulations, Zhongqiang!
>
>
> Best,
> Zakelly
>
> On Mon, Jun 17, 2024 at 12:05 PM Shawn Huang  wrote:
>
> > Congratulations !
> >
> > Best,
> > Shawn Huang
> >
> >
> > Yuepeng Pan  于2024年6月17日周一 12:03写道：
> >
> > > Congratulations ! Best regards Yuepeng Pan
> > >
> > >
> > >
> > >
> > >
> > > At 2024-06-17 11:20:30, "Leonard Xu"  wrote:
> > > >Hi everyone,
> > > >On behalf of the PMC, I'm happy to announce that Zhongqiang Gong has
> > > become a new Flink Committer!
> > > >
> > > >Zhongqiang has been an active Flink community member since November
> > 2021,
> > > contributing numerous PRs to both the Flink and Flink CDC repositories.
> > As
> > > a core contributor to Flink CDC, he developed the Oracle and SQL Server
> > CDC
> > > Connectors and managed essential website and CI migrations during the
> > > donation of Flink CDC to Apache Flink.
> > > >
> > > >Beyond his technical contributions, Zhongqiang actively participates
> in
> > > discussions on the Flink dev mailing list and responds to threads on
> the
> > > user and user-zh mailing lists. As an Apache StreamPark (incubating)
> > > Committer, he promotes Flink SQL and Flink CDC technologies at meetups
> > and
> > > within the StreamPark community.
> > > >
> > > >Please join me in congratulating Zhongqiang Gong for becoming an
> Apache
> > > Flink committer!
> > > >
> > > >Best,
> > > >Leonard (on behalf of the Flink PMC)
> > >
> >
>

Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

2024-06-17 Thread Ron Liu

Hi, Sergio

Thanks for your reply.

> PKEYS are not inherited from the select query. Also UNIQUE KEY is not
supported on the create statements yet.
All columns queried in the select part form a query schema with only the
column name, type and null/not null properties.
Based on this, I don't see there will be an issue with uniqueness. Users
may define a pkey on the not null columns
only.

Sorry if I wasn't clear about this issue. My question is, if a primary key
is defined on column a in the target table of a CTAS, and the data fetched
by the Select query can't guarantee uniqueness based on column a, then what
is relied upon to ensure uniqueness?
Is it something that the engine does, or is it dependent on the storage
mechanism of the target table?

Best,
Ron


Sergio Pena  于2024年6月18日周二 01:24写道：

> Hi Ron,
> Thanks for your feedback.
>
> Is it possible for you to list these systems and make it part of the
> > Reference chapter? Which would make it easier for everyone to understand.
> >
>
> I updated the flip to include the ones I found (mysql, postgres, oracle).
> Though postgres/oracle semantics are different, they at least allow you
> some schema changes in the create part.
> I think spark sql supports it too, but I couldn't find a way to test it, so
> I didn't include it in the list.
>
> One is whether Nullable columns are
>
> allowed to be referenced in the primary key definition, because all the
>
> columns deduced based on Select Query may be Nullable;
>
>
> I updated the flip. But in short, CTAS will use the same rules and
> validations as other CREATE statements.
> Pkeys are not allowed on NULL columns, so CTAS would fail too. The CTAS
> inherits the NULL and NOT NULL
> constraints from the query schema, so PKEY can be used on the NOT NULL cols
> only.
>
> if the UNIQUE KEY cannot be deduced based on Select Query, or the
> > deduced UNIQUE KEY does not match the defined primary key, which will
> lead
> > to data duplication, the engine needs to what to ensure the uniqueness of
> > the data?
> >
>
> PKEYS are not inherited from the select query. Also UNIQUE KEY is not
> supported on the create statements yet.
> All columns queried in the select part form a query schema with only the
> column name, type and null/not null properties.
> Based on this, I don't see there will be an issue with uniqueness. Users
> may define a pkey on the not null columns
> only.
>
> - Sergio
>
>
> On Sun, Jun 16, 2024 at 9:55 PM Ron Liu  wrote:
>
> > Hi, Sergio
> >
> > Sorry for later joining this thread.
> >
> > Thanks for driving this proposal, it looks great.
> >
> > I have a few questions:
> >
> > 1. Many SQL-based data processing systems, however, support schema
> > definitions within their CTAS statements
> >
> > Is it possible for you to list these systems and make it part of the
> > Reference chapter? Which would make it easier for everyone to understand.
> >
> >
> > 2. This proposal proposes to support defining primary keys in CTAS, then
> > there are two possible issues here. One is whether Nullable columns are
> > allowed to be referenced in the primary key definition, because all the
> > columns deduced based on Select Query may be Nullable; and the second is
> > that if the UNIQUE KEY cannot be deduced based on Select Query, or the
> > deduced UNIQUE KEY does not match the defined primary key, which will
> lead
> > to data duplication, the engine needs to what to ensure the uniqueness of
> > the data?
> >
> >
> > Best
> > Ron
> >
> >
> > Jeyhun Karimov  于2024年6月14日周五 01:51写道：
> >
> > > Thanks Sergio and Timo for your answers.
> > > Sounds good to me.
> > > Looking forward for this feature.
> > >
> > > Regards,
> > > Jeyhun
> > >
> > > On Thu, Jun 13, 2024 at 4:48 PM Sergio Pena
>  > >
> > > wrote:
> > >
> > > > Sure Yuxia, I just added the support for RTAS statements too.
> > > >
> > > > - Sergio
> > > >
> > > > On Wed, Jun 12, 2024 at 8:22 PM yuxia 
> > > wrote:
> > > >
> > > > > Hi, Sergio.
> > > > > Thanks for driving the FLIP. Given we also has REPLACE TABLE AS
> > > > > Statement[1] and it's almost same with CREATE TABLE AS Statement,
> > > > > would you mind also supporting schema definition for REPLACE TABLE
> AS
> > > > > Statement in this FLIP? It'll be a great to align REPLACE TABLE AS
> > > > Statement
> > > > > to CREATE TABLE AS Sta

Re: [DISCUSS] FLIP-468: Introducing StreamGraph-Based Job Submission.

2024-07-11 Thread Ron Liu

Hi, Junrui

The FLIP proposal looks good to me.

I have the same question as Fabian:

> For join strategies, they are only
applicable when using an optimizer (that's currently not part of Flink's
runtime) with the Table API or Flink SQL. How do we plan to connect the
optimizer with Flink's runtime?

For batch scenario, if we want to better support dynamic plan tuning
strategies, the fundamental solution is still to put SQL Optimizer to
flink-runtime.

Best,
Ron

David Morávek  于2024年7月11日周四 19:17写道：

> Hi Junrui,
>
> Thank you for drafting the FLIP. I really appreciate the direction it’s
> taking. We’ve discussed similar approaches multiple times, and it’s great
> to see this progress.
>
> I have a few questions and thoughts:
>
>
> * 1. Transformations in StreamGraphGenerator:*
> Should we consider taking this a step further by working on a list of
> transformations (inputs of StreamGraphGenerator)?
>
> public StreamGraphGenerator(
> List> transformations,
> ExecutionConfig executionConfig,
> CheckpointConfig checkpointConfig,
> ReadableConfig configuration) {
>
> We could potentially merge ExecutionConfig and CheckpointConfig into
> ReadableConfig. This approach might offer us even more flexibility.
>
>
> *2. StreamGraph for Recovery Purposes:*
> Should we avoid using StreamGraph for recovery purposes? The existing
> JG-based recovery code paths took years to perfect, and it doesn’t seem
> necessary to replace them. We only need SG for cases where we want to
> regenerate the JG.
> Additionally, translating SG into JG before persisting it in HA could be
> beneficial, as it allows us to catch potential issues early on.
>
>
> * 3. Moving Away from Java Serializables:*
> It would be great to start moving away from Java Serializables as much as
> possible. Could we instead define proper versioned serializers, possibly
> based on a well-defined protobuf blueprint? This change could help us avoid
> ongoing issues associated with Serializables.
>
> Looking forward to your thoughts.
>
> Best,
> D.
>
> On Thu, Jul 11, 2024 at 12:58 PM Fabian Paul  wrote:
>
> > Thanks for drafting this FLIP. I really like the idea of introducing a
> > concept in Flink that is close to a logical plan submission.
> >
> > I have a few questions about the proposal and its future evolvability.
> >
> > - What is the future plan for job submissions in Flink? With the current
> > proposal, Flink will support JobGraph/StreamGraph/compiled plan
> > submissions? It might be confusing for users and complicate the existing
> > job submission logic significantly.
> > - The FLIP mentions multiple areas of optimization, first operator
> chaining
> > and second dynamic switches between join strategies. I think from a Flink
> > perspective, these are, at the moment, separate concerns.  For operator
> > chaining, I agree with the current proposal, which is a concept that
> > applies generally to Flink's runtime. For join strategies, they are only
> > applicable when using an optimizer (that's currently not part of Flink's
> > runtime) with the Table API or Flink SQL. How do we plan to connect the
> > optimizer with Flink's runtime?
> > - With table/SQL API we already expose a compiled plan to support stable
> > version upgrades. It would be great to explore a joined plan to also
> offer
> > stable version upgrades with a potentially persistent streamgraph.
> >
> > Best,
> > Fabian
> >
>

Re: [VOTE] Release 1.20.0, release candidate #1

2024-07-22 Thread Ron Liu

Hi, Weijie

Sorry about the newly discovered bug affecting the release process.

The fix pr of https://issues.apache.org/jira/browse/FLINK-35872 has been
merged.

Best,
Ron

Feng Jin  于2024年7月22日周一 10:57写道：

> Hi, weijie
>
> -1 (non-binding)
>
> During our testing process, we discovered a critical bug that impacts the
> correctness of the materialized table.
> A fix pr [1] is now prepared and will be merged within the next two days.
>
> I apologize for any inconvenience during the release process.
>
>
> [1]. https://issues.apache.org/jira/browse/FLINK-35872
>
>
> Best,
>
> Feng
>
>
> On Fri, Jul 19, 2024 at 5:45 PM Xintong Song 
> wrote:
>
> > +1 (binding)
> >
> > - reviewed flink-web PR
> > - verified checksum and signature
> > - verified source archives don't contain binaries
> > - built from source
> > - tried example jobs on a standalone cluster, and everything looks fine
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Jul 18, 2024 at 4:25 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > +1(binding)
> > >
> > > - Reviewed the flink-web PR (Left some comments)
> > > - Checked Github release tag
> > > - Verified signatures
> > > - Verified sha512 (hashsums)
> > > - The source archives don't contain any binaries
> > > - Build the source with Maven 3 and java8 (Checked the license as well)
> > > - Start the cluster locally with jdk8, and run the StateMachineExample
> > job,
> > > it works fine.
> > >
> > > Best,
> > > Rui
> > >
> > > On Mon, Jul 15, 2024 at 10:59 PM weijie guo  >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > > 1.20.0,
> > > >
> > > > as follows:
> > > >
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1], and the pull request adding release note
> for
> > > >
> > > > users [2]
> > > >
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > >
> > > > deployed to dist.apache.org [3], which are signed with the key with
> > > >
> > > > fingerprint 8D56AE6E7082699A4870750EA4E8C4C05EE6861F  [4],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [5],
> > > >
> > > > * source code tag "release-1.20.0-rc1" [6],
> > > >
> > > > * website pull request listing the new release and adding
> announcement
> > > blog
> > > >
> > > > post [7].
> > > >
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > >
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354210
> > > >
> > > > [2] https://github.com/apache/flink/pull/25091
> > > >
> > > > [3] https://dist.apache.org/repos/dist/dev/flink/flink-1.20.0-rc1/
> > > >
> > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >
> > > > [5]
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1744/
> > > >
> > > > [6] https://github.com/apache/flink/releases/tag/release-1.20.0-rc1
> > > >
> > > > [7] https://github.com/apache/flink-web/pull/751
> > > >
> > > >
> > > > Best,
> > > >
> > > > Robert, Rui, Ufuk, Weijie
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-468: Introducing StreamGraph-Based Job Submission.

2024-07-23 Thread Ron Liu

iate any insights that you might
> > offer on this matter.
> >
> > *2. StreamGraph for Recovery Purposes:
> >
> >
> > In fact, this conflicts with our desire to make adaptive optimization to
> > the StreamGraph at runtime, as under such scenarios, the StreamGraph is
> the
> > complete expression of a job's logic, not the JobGraph. More details can
> > refer to the specific details in FLIP-469.
> >
> > I have reviewed the code related to the JobGraphStore and discovered that
> > it can be extended to store both StreamGraph and JobGraph simultaneously.
> > As for your concerns, can we consider the following: in batch mode, we
> use
> > Job Recovery based on StreamGraph, whereas for stream mode, we continue
> to
> > use the original JobGraph recovery and the StreamGraph would be converted
> > to a JobGraph right at the beginning.
> >
> > 3. Moving Away from Java Serializables:
> >
> >
> > Are you suggesting that Java serialization has limitations, and that we
> > should explore alternative serialization approaches? I agree that this
> is a
> > valuable consideration for the future. Do you think this should be
> included
> > in this FLIP? I would prefer to address it as a separate FLIP.
> >
> > Best,
> > Junrui
> >
> > David Morávek  于2024年7月12日周五 14:47写道：
> >
> >> >
> >> > For batch scenario, if we want to better support dynamic plan tuning
> >> > strategies, the fundamental solution is still to put SQL Optimizer to
> >> > flink-runtime.
> >>
> >>
> >> One accompanying question is: how do you envision this to work for
> >> streaming where you need to ensure state compatibility after the plan
> >> change? FLIP-496 seems to only focus on batch.
> >>
> >> Best,
> >> D.
> >>
> >> On Fri, Jul 12, 2024 at 4:29 AM Ron Liu  wrote:
> >>
> >> > Hi, Junrui
> >> >
> >> > The FLIP proposal looks good to me.
> >> >
> >> > I have the same question as Fabian:
> >> >
> >> > > For join strategies, they are only
> >> > applicable when using an optimizer (that's currently not part of
> Flink's
> >> > runtime) with the Table API or Flink SQL. How do we plan to connect
> the
> >> > optimizer with Flink's runtime?
> >> >
> >> > For batch scenario, if we want to better support dynamic plan tuning
> >> > strategies, the fundamental solution is still to put SQL Optimizer to
> >> > flink-runtime.
> >> >
> >> > Best,
> >> > Ron
> >> >
> >> > David Morávek  于2024年7月11日周四 19:17写道：
> >> >
> >> > > Hi Junrui,
> >> > >
> >> > > Thank you for drafting the FLIP. I really appreciate the direction
> >> it’s
> >> > > taking. We’ve discussed similar approaches multiple times, and it’s
> >> great
> >> > > to see this progress.
> >> > >
> >> > > I have a few questions and thoughts:
> >> > >
> >> > >
> >> > > * 1. Transformations in StreamGraphGenerator:*
> >> > > Should we consider taking this a step further by working on a list
> of
> >> > > transformations (inputs of StreamGraphGenerator)?
> >> > >
> >> > > public StreamGraphGenerator(
> >> > > List> transformations,
> >> > > ExecutionConfig executionConfig,
> >> > > CheckpointConfig checkpointConfig,
> >> > > ReadableConfig configuration) {
> >> > >
> >> > > We could potentially merge ExecutionConfig and CheckpointConfig into
> >> > > ReadableConfig. This approach might offer us even more flexibility.
> >> > >
> >> > >
> >> > > *2. StreamGraph for Recovery Purposes:*
> >> > > Should we avoid using StreamGraph for recovery purposes? The
> existing
> >> > > JG-based recovery code paths took years to perfect, and it doesn’t
> >> seem
> >> > > necessary to replace them. We only need SG for cases where we want
> to
> >> > > regenerate the JG.
> >> > > Additionally, translating SG into JG before persisting it in HA
> could
> >> be
> >> > > beneficial, as it allows us to catch potential issues early on.
> >> > >
> >> > >
> >> &g

Re: [DISCUSS] FLIP-469: Supports Adaptive Optimization of StreamGraph

2024-07-23 Thread Ron Liu

Hi, Junrui

Thanks for the proposal, this design allows the Flink engine to become
smarter by doing more dynamic optimizations at runtime. so +1 from my side.

For your FLIP, I've one minor question.

Regarding the StreamGraphOptimizationStrategy, you mentioned introducing
the option
`execution.batch.adaptive.stream-graph-optimization.strategies`(List Type)
and passing it to the runtime, is there a better way to pass it to the
runtime?
 Is there a better way to pass it to the runtime than using the
configuration parameter? Another thing is that suppose two optimization
strategies a, b are executed in order, a must be executed first, then b.
How do you let the user perceive that a must be in front of a when setting
parameters, and a must be behind, and can the list type always be
order-preserving?


Best,
Ron

Junrui Lee  于2024年7月18日周四 12:12写道：

> Hi, Weijie
>
> Thanks for your feedback!
>
> `StreamGraphOptimizationStrategy` is a reasonable abstract, I'd like to
> know what built-in strategy implementations you have in mind so far?
>
> We will introduce two optimization strategies:
> AdaptiveBroadcastJoinOptimizeStrategy, which dynamically determines and
> switches to BroadcastJoin, and SkewedJoinOptimizeStrategy, which addresses
> data skew issues.
>
> For the so-called pending operators, can we show it in different colors
> on the UI.
>
> Yes, we will use different colors (such as green) to display the pending
> operators.
>
> Best regards,
> Junrui
>
> weijie guo  于2024年7月17日周三 20:15写道：
>
> > Thanks for the proposal!
> >
> > I like this idea as it gives Flink's adaptive batching processing more
> room
> > to imagine and optimize.
> >
> > So, +1 from my side.
> >
> > I just have two questions:
> >
> > 1. `StreamGraphOptimizationStrategy` is a reasonable abstract, I'd like
> to
> > know what built-in strategy implementations you have in mind so far?
> >
> > 2. For the so-called pending operators, can we show it in different
> colors
> > on the UI.
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Zhu Zhu  于2024年7月17日周三 16:49写道：
> >
> > > Thanks Junrui for the updates. The proposal looks good to me.
> > > With the stream graph added to the REST API result, I think we are
> > > also quite close to enable Flink to expand a job vertex to show its
> > > operator-chain topology.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Junrui Lee  于2024年7月15日周一 14:58写道：
> > >
> > > > Hi Zhu,
> > > >
> > > > Thanks for your feedback.
> > > >
> > > > Following your suggestion, I have updated the public interface
> section
> > of
> > > > the FLIP with the following additions:
> > > >
> > > > 1. UI:
> > > > The job topology will display a hybrid of the current JobGraph along
> > with
> > > > downstream components yet to be converted to a StreamGraph. On the
> > > topology
> > > > graph display page, there will be a "Show Pending Operators" button
> in
> > > the
> > > > upper right corner for users to switch back to a job topology that
> only
> > > > includes JobVertices.
> > > >
> > > > 2. Rest API:
> > > > Add a new field "stream-graph-plan" will be added to the job details
> > REST
> > > > API, which represents the runtime Stream graph. The field
> > "job-vertex-id"
> > > > is valid only when the StreamNode has been converted to a JobVertex,
> > and
> > > it
> > > > will hold the ID of the corresponding JobVertex for that StreamNode.
> > > >
> > > > For further information, please feel free to review the public
> > interface
> > > > section of FLIP-469
> > > >
> > > > Best,
> > > > Junrui
> > > >
> > > > Zhu Zhu  于2024年7月15日周一 10:29写道：
> > > >
> > > > > +1 for the FLIP
> > > > >
> > > > > It is useful to adaptively optimize logical execution plans(stream
> > > > > operators and
> > > > > stream edges) for batch jobs.
> > > > >
> > > > > One question:
> > > > > The FLIP already proposed to update the REST API & Web UI to show
> > > > operators
> > > > > that are not yet converted to job vertices. However, I think it
> would
> > > be
> > > > > better if Flink can display these operators as part of the graph,
> > > > allowing
> > > > > users to have an overview of the processing logic graph at early
> > stages
> > > > of
> > > > > the job execution.
> > > > > This change would also involve the public interface, so instead of
> > > > > postponing
> > > > > it to a later FLIP, I prefer to have a design for it in this FLIP.
> > > WDYT?
> > > > >
> > > > > Thanks,
> > > > > Zhu
> > > > >
> > > > > Junrui Lee  于2024年7月11日周四 11:27写道：
> > > > >
> > > > > > Hi devs,
> > > > > >
> > > > > > Xia Sun, Lei Yang, and I would like to initiate a discussion
> about
> > > > > > FLIP-469: Supports Adaptive Optimization of StreamGraph.
> > > > > >
> > > > > > This FLIP is the second in the series on adaptive optimization of
> > > > > > StreamGraph and follows up on FLIP-468 [1]. As we proposed in
> > > FLIP-468
> > > > to
> > > > > > enable the scheduler to recognize and access the StreamGraph, in
> > this
> > > > > FLIP,
> > > > > > we will propos

Re: [DISCUSS] FLIP-470: Support Adaptive Broadcast Join

2024-07-23 Thread Ron Liu

Hi, Xia

This FLIP looks good to me, +1.

I've two questions:

1.
>> Accordingly, in terms of implementation, we will delay the codegen and
creation of the join operators until runtime.

How are you delaying codegen to runtime, the current runtime is not SQL
planner aware. in other words, how do I understand this sentence?

2. FLIP-469 mentions passing StreamGraphOptimizationStrategy to runtime via
option `execution.batch.adaptive.stream-graph-optimization.strategies`. In
SQL planner if you have multiple different optimization strategies like
broadcast join, skew join, etc...  When did you configure these
optimization strategies uniformly into
`execution.batch.adaptive.stream-graph-optimization.strategies`?



Zhu Zhu  于2024年7月19日周五 17:41写道：

> +1 for the FLIP
>
> It's a good start to adaptively optimize the logical execution plan with
> runtime information.
>
> Thanks,
> Zhu
>
> Xia Sun  于2024年7月18日周四 18:23写道：
>
> > Hi devs,
> >
> > Junrui Lee, Lei Yang, and I would like to initiate a discussion about
> > FLIP-470: Support Adaptive Broadcast Join[1].
> >
> > In general, Broadcast Hash Join is currently the most efficient join
> > strategy available in Flink. However, its prerequisite is that the input
> > data on one side must be sufficiently small; otherwise, it may lead to
> > memory overuse or other issues. Currently, due to the lack of precise
> > statistics, it is difficult to make accurate estimations during the Flink
> > SQL Planning phase. For example, when an upstream Filter operator is
> > present, it is easy to overestimate the size of the table, whereas with
> > an expansion operator, the table size tends to be underestimated.
> Moreover,
> > once the join operator is determined, it cannot be modified at runtime.
> >
> > To address this issue, we plan to introduce Adaptive Broadcast Join
> > capability based on FLIP-468: Introducing StreamGraph-Based Job
> > Submission[2]
> > and FLIP-469: Supports Adaptive Optimization of StreamGraph[3]. This will
> > allow the join operator to be dynamically optimized to Broadcast Join
> based
> > on the actual input data volume at runtime and fallback when the
> > optimization
> > conditions are not met.
> >
> > For more details, please refer to FLIP-470[1]. We look forward to your
> > feedback.
> >
> > Best,
> > Junrui Lee, Lei Yang and Xia Sun
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-470%3A+Support+Adaptive+Broadcast+Join
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-468%3A+Introducing+StreamGraph-Based+Job+Submission
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-469%3A+Supports+Adaptive+Optimization+of+StreamGraph
> >
>

Re: [DISCUSS] FLIP-469: Supports Adaptive Optimization of StreamGraph

2024-07-23 Thread Ron Liu

One additional question: for a JobEvent, are all
StreamGraphOptimizationStrategy traversed and executed once?

Best,
Ron

Ron Liu  于2024年7月24日周三 13:59写道：

> Hi, Junrui
>
> Thanks for the proposal, this design allows the Flink engine to become
> smarter by doing more dynamic optimizations at runtime. so +1 from my side.
>
> For your FLIP, I've one minor question.
>
> Regarding the StreamGraphOptimizationStrategy, you mentioned introducing
> the option
> `execution.batch.adaptive.stream-graph-optimization.strategies`(List Type)
> and passing it to the runtime, is there a better way to pass it to the
> runtime?
>  Is there a better way to pass it to the runtime than using the
> configuration parameter? Another thing is that suppose two optimization
> strategies a, b are executed in order, a must be executed first, then b.
> How do you let the user perceive that a must be in front of a when setting
> parameters, and a must be behind, and can the list type always be
> order-preserving?
>
>
> Best,
> Ron
>
> Junrui Lee  于2024年7月18日周四 12:12写道：
>
>> Hi, Weijie
>>
>> Thanks for your feedback!
>>
>> `StreamGraphOptimizationStrategy` is a reasonable abstract, I'd like to
>> know what built-in strategy implementations you have in mind so far?
>>
>> We will introduce two optimization strategies:
>> AdaptiveBroadcastJoinOptimizeStrategy, which dynamically determines and
>> switches to BroadcastJoin, and SkewedJoinOptimizeStrategy, which addresses
>> data skew issues.
>>
>> For the so-called pending operators, can we show it in different colors
>> on the UI.
>>
>> Yes, we will use different colors (such as green) to display the pending
>> operators.
>>
>> Best regards,
>> Junrui
>>
>> weijie guo  于2024年7月17日周三 20:15写道：
>>
>> > Thanks for the proposal!
>> >
>> > I like this idea as it gives Flink's adaptive batching processing more
>> room
>> > to imagine and optimize.
>> >
>> > So, +1 from my side.
>> >
>> > I just have two questions:
>> >
>> > 1. `StreamGraphOptimizationStrategy` is a reasonable abstract, I'd like
>> to
>> > know what built-in strategy implementations you have in mind so far?
>> >
>> > 2. For the so-called pending operators, can we show it in different
>> colors
>> > on the UI.
>> >
>> >
>> > Best regards,
>> >
>> > Weijie
>> >
>> >
>> > Zhu Zhu  于2024年7月17日周三 16:49写道：
>> >
>> > > Thanks Junrui for the updates. The proposal looks good to me.
>> > > With the stream graph added to the REST API result, I think we are
>> > > also quite close to enable Flink to expand a job vertex to show its
>> > > operator-chain topology.
>> > >
>> > > Thanks,
>> > > Zhu
>> > >
>> > > Junrui Lee  于2024年7月15日周一 14:58写道：
>> > >
>> > > > Hi Zhu,
>> > > >
>> > > > Thanks for your feedback.
>> > > >
>> > > > Following your suggestion, I have updated the public interface
>> section
>> > of
>> > > > the FLIP with the following additions:
>> > > >
>> > > > 1. UI:
>> > > > The job topology will display a hybrid of the current JobGraph along
>> > with
>> > > > downstream components yet to be converted to a StreamGraph. On the
>> > > topology
>> > > > graph display page, there will be a "Show Pending Operators" button
>> in
>> > > the
>> > > > upper right corner for users to switch back to a job topology that
>> only
>> > > > includes JobVertices.
>> > > >
>> > > > 2. Rest API:
>> > > > Add a new field "stream-graph-plan" will be added to the job details
>> > REST
>> > > > API, which represents the runtime Stream graph. The field
>> > "job-vertex-id"
>> > > > is valid only when the StreamNode has been converted to a JobVertex,
>> > and
>> > > it
>> > > > will hold the ID of the corresponding JobVertex for that StreamNode.
>> > > >
>> > > > For further information, please feel free to review the public
>> > interface
>> > > > section of FLIP-469
>> > > >
>> > > > Best,
>> > > > Junrui
>> > > >
>> > > > Zhu Zhu  于2024年7月15日周一 10:29写道：

Re: [VOTE] FLIP-468: Introducing StreamGraph-Based Job Submission

2024-07-25 Thread Ron Liu

+1(binding)

Best,
Ron

Rui Fan <1996fan...@gmail.com> 于2024年7月26日周五 11:50写道：

> Thanks Junrui for driving this proposal!
>
> +1(binding)
>
> Best,
> Rui
>
> On Fri, Jul 26, 2024 at 11:02 AM Junrui Lee  wrote:
>
> > Hi everyone,
> >
> > Thanks for all the feedback about FLIP-468: Introducing StreamGraph-Based
> > Job Submission [1]. The discussion thread can be found here [2].
> >
> > The vote will be open for at least 72 hours unless there are any
> objections
> > or insufficient votes.
> >
> > Best,
> >
> > Junrui
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-468%3A+Introducing+StreamGraph-Based+Job+Submission
> >
> > [2] https://lists.apache.org/thread/l65mst4prx09rmloydn64z1w1zp8coyz
> >
>

Re: [DISCUSS] FLIP-470: Support Adaptive Broadcast Join

2024-07-25 Thread Ron Liu

Hi, Xia

Thanks for your reply. It looks good to me.


Best,
Ron

Xia Sun  于2024年7月26日周五 10:49写道：

> Hi Ron,
>
> Thanks for your feedback!
>
> -> creation of the join operators until runtime
>
>
> That means when creating the AdaptiveJoinOperatorFactory, we will not
> immediately create the JoinOperator. Instead, we only pass in the necessary
> parameters for creating the JoinOperator. The appropriate JoinOperator will
> be created during the StreamGraphOptimizationStrategy optimization phase.
>
> You mentioned that the runtime's visibility into the table planner is
> indeed an issue. It includes two aspects,
> (1) we plan to place both implementations of the
> AdaptiveBroadcastJoinOptimizationStrategy and AdaptiveJoinOperatorFactory
> in the table layer. During the runtime phase, we will obtain the
> AdaptiveBroadcastJoinOptimizationStrategy through class loading. Therefore,
> the flink-runtime does not need to be aware of the table layer's
> implementation.
> (2) Since the dynamic codegen in the AdaptiveJoinOperatorFactory needs to
> be aware of the table planner, we will consider placing the
> AdaptiveJoinOperatorFactory in the table planner module as well.
>
>
>  -> When did you configure these optimization strategies uniformly into
> > `execution.batch.adaptive.stream-graph-optimization.strategies`
>
>
> Thank you for pointing out this issue. When there are multiple
> StreamGraphOptimizationStrategies, the optimization order at the runtime
> phase will strictly follow the order specified in the configuration option
> `execution.batch.adaptive.stream-graph-optimization.strategies`. Therefore,
> it is necessary to have a unified configuration during the sql planner
> phase to ensure the correct optimization order. Currently, we are
> considering performing this unified configuration in
> BatchPlanner#afterTranslation().
>
> For simplicity, as long as the adaptive broadcast join/skewed join
> optimization features are enabled (e.g.,
> `table.optimizer.adaptive.join.broadcast-threshold` is not -1), the
> corresponding strategy will be configured. This optimization is independent
> of the specific SQL query, although it might not produce any actual effect.
>
> Best,
> Xia
>
> Ron Liu  于2024年7月24日周三 14:10写道：
>
> > Hi, Xia
> >
> > This FLIP looks good to me, +1.
> >
> > I've two questions:
> >
> > 1.
> > >> Accordingly, in terms of implementation, we will delay the codegen and
> > creation of the join operators until runtime.
> >
> > How are you delaying codegen to runtime, the current runtime is not SQL
> > planner aware. in other words, how do I understand this sentence?
> >
> > 2. FLIP-469 mentions passing StreamGraphOptimizationStrategy to runtime
> via
> > option `execution.batch.adaptive.stream-graph-optimization.strategies`.
> In
> > SQL planner if you have multiple different optimization strategies like
> > broadcast join, skew join, etc...  When did you configure these
> > optimization strategies uniformly into
> > `execution.batch.adaptive.stream-graph-optimization.strategies`?
> >
> >
> >
> > Zhu Zhu  于2024年7月19日周五 17:41写道：
> >
> > > +1 for the FLIP
> > >
> > > It's a good start to adaptively optimize the logical execution plan
> with
> > > runtime information.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Xia Sun  于2024年7月18日周四 18:23写道：
> > >
> > > > Hi devs,
> > > >
> > > > Junrui Lee, Lei Yang, and I would like to initiate a discussion about
> > > > FLIP-470: Support Adaptive Broadcast Join[1].
> > > >
> > > > In general, Broadcast Hash Join is currently the most efficient join
> > > > strategy available in Flink. However, its prerequisite is that the
> > input
> > > > data on one side must be sufficiently small; otherwise, it may lead
> to
> > > > memory overuse or other issues. Currently, due to the lack of precise
> > > > statistics, it is difficult to make accurate estimations during the
> > Flink
> > > > SQL Planning phase. For example, when an upstream Filter operator is
> > > > present, it is easy to overestimate the size of the table, whereas
> with
> > > > an expansion operator, the table size tends to be underestimated.
> > > Moreover,
> > > > once the join operator is determined, it cannot be modified at
> runtime.
> > > >
> > > > To address this issue, we plan to introduce Adaptive Broadcast Join
> > > > capability based on FLIP-468: Intro

Re: [DISCUSS] FLIP-469: Supports Adaptive Optimization of StreamGraph

2024-07-25 Thread Ron Liu

Thanks for your updated, the FLIP looks good to me.

Junrui Lee  于2024年7月26日周五 12:10写道：

> Hi All,
>
> After an offline discussion with Ron, I have made the following updates to
> FLIP-469:
>
> Introduced the OperatorsFinished class to represent the state of finished
> operators, as well as the size and distribution of the data they produced.
> The StreamGraphOptimizer strategy will now depend on OperatorsFinished
> objects instead of JobEvent. JobEvent will be used only at runtime. When
> the AdaptiveExecutionHandler receives a JobVertexFinishedEvent, it will
> convert it to an OperatorsFinished object and send it to the
> StreamGraphOptimizer to attempt to trigger an optimization of the
> StreamGraph.
>
> Best,
> Junrui
>
> Junrui Lee  于2024年7月24日周三 15:46写道：
>
> > Hi Ron,
> >
> > Thank you for your questions regarding the
> > StreamGraphOptimizationStrategy. Here are my responses:
> >
> > 1.is there a better way to pass it to the
> > runtime?
> >
> > At the moment, we have not thought of a better way.
> >
> > Considering that this is a class from the table layer, it can only be
> seen
> > at the runtime layer through class loading. Additionally, this is also a
> > configuration item that needs to be converged into the configuration
> > (aligned with the goal of Flink 2.0 configuration refactor work).
> >
> > 2.can the list type always be
> > order-preserving
> >
> > Yes, we will ensure that the order in which this config parameter is
> > loaded is consistent with the order set by the user. The execution order
> > will also follow this sequence.
> >
> > 3.are all
> > StreamGraphOptimizationStrategy traversed and executed once?
> >
> > Yes, they are traversed and executed only once.
> >
> > Best regards,
> > Junrui
> >
> > Ron Liu  于2024年7月24日周三 14:12写道：
> >
> >> One additional question: for a JobEvent, are all
> >> StreamGraphOptimizationStrategy traversed and executed once?
> >>
> >> Best,
> >> Ron
> >>
> >> Ron Liu  于2024年7月24日周三 13:59写道：
> >>
> >> > Hi, Junrui
> >> >
> >> > Thanks for the proposal, this design allows the Flink engine to become
> >> > smarter by doing more dynamic optimizations at runtime. so +1 from my
> >> side.
> >> >
> >> > For your FLIP, I've one minor question.
> >> >
> >> > Regarding the StreamGraphOptimizationStrategy, you mentioned
> introducing
> >> > the option
> >> > `execution.batch.adaptive.stream-graph-optimization.strategies`(List
> >> Type)
> >> > and passing it to the runtime, is there a better way to pass it to the
> >> > runtime?
> >> >  Is there a better way to pass it to the runtime than using the
> >> > configuration parameter? Another thing is that suppose two
> optimization
> >> > strategies a, b are executed in order, a must be executed first, then
> b.
> >> > How do you let the user perceive that a must be in front of a when
> >> setting
> >> > parameters, and a must be behind, and can the list type always be
> >> > order-preserving?
> >> >
> >> >
> >> > Best,
> >> > Ron
> >> >
> >> > Junrui Lee  于2024年7月18日周四 12:12写道：
> >> >
> >> >> Hi, Weijie
> >> >>
> >> >> Thanks for your feedback!
> >> >>
> >> >> `StreamGraphOptimizationStrategy` is a reasonable abstract, I'd like
> to
> >> >> know what built-in strategy implementations you have in mind so far?
> >> >>
> >> >> We will introduce two optimization strategies:
> >> >> AdaptiveBroadcastJoinOptimizeStrategy, which dynamically determines
> and
> >> >> switches to BroadcastJoin, and SkewedJoinOptimizeStrategy, which
> >> addresses
> >> >> data skew issues.
> >> >>
> >> >> For the so-called pending operators, can we show it in different
> colors
> >> >> on the UI.
> >> >>
> >> >> Yes, we will use different colors (such as green) to display the
> >> pending
> >> >> operators.
> >> >>
> >> >> Best regards,
> >> >> Junrui
> >> >>
> >> >> weijie guo  于2024年7月17日周三 20:15写道：
> >> >>
> >> >> > Thanks for the proposal!
> >> >> >
> >>

Re: [VOTE] FLIP-469: Supports Adaptive Optimization of StreamGraph

2024-07-28 Thread Ron Liu

+1(binding)

Best,
Ron

weijie guo  于2024年7月29日周一 10:03写道：

> +1(binding)
>
> Best regards,
>
> Weijie
>
>
> Junrui Lee  于2024年7月29日周一 09:38写道：
>
> > Hi everyone,
> >
> > Thanks for all the feedback about FLIP-469: Supports Adaptive
> Optimization
> > of StreamGraph [1]. The discussion thread can be found here [2].
> >
> > The vote will be open for at least 72 hours unless there are any
> objections
> > or insufficient votes.
> >
> > Best,
> >
> > Junrui
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-469%3A+Supports+Adaptive+Optimization+of+StreamGraph
> >
> > [2] https://lists.apache.org/thread/zs7sqpzvcvdb9y42ym6ndtn1fn7m2592
> >
>

Re: [VOTE] FLIP-470: Support Adaptive Broadcast Join

2024-08-14 Thread Ron Liu

+1(binding)

Best,
Ron

Zhu Zhu  于2024年8月12日周一 13:35写道：

> +1 (binding)
>
> Thanks,
> Zhu
>
> Lincoln Lee  于2024年8月12日周一 13:09写道：
>
> > +1
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Xia Sun  于2024年8月12日周一 09:57写道：
> >
> > > Hi everyone,
> > >
> > > Thanks for all the feedback about FLIP-470: Support Adaptive Broadcast
> > > Join[1].
> > > The discussion thread can be found here[2].
> > >
> > > The vote will be open for at least 72 hours unless there are any
> > objections
> > > or insufficient votes.
> > >
> > > Best
> > > Xia
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-470%3A+Support+Adaptive+Broadcast+Join
> > >
> > > [2] https://lists.apache.org/thread/cmkh3n6l86vx4v16chhrjr22g6w0zort
> > >
> >
>

Re: [VOTE] FLIP-473: Introduce New SQL Operators Based on Asynchronous State APIs

2024-08-21 Thread Ron Liu

+1(binding)

Best,
Ron

Zakelly Lan  于2024年8月21日周三 11:16写道：

> +1 (binding)
>
> Best,
> Zakelly
>
> On Wed, Aug 21, 2024 at 10:33 AM Xuyang  wrote:
>
> > Hi, everyone.
> >
> > I would like to start a vote on FLIP-473: Introduce New SQL Operators
> Based
> >
> > on Asynchronous State APIs [1]. The discussion thread can be found here
> > [2].
> >
> > The vote will be open for at least 72 hours unless there are any
> objections
> >
> > or insufficient votes.
> >
> >
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-473+Introduce+New+SQL+Operators+Based+on+Asynchronous+State+APIs
> >
> > [2] https://lists.apache.org/thread/k6x03x7vjtn3gl1vrknkx8zvyn319bk9
> >
> >
> >
> >
> > --
> >
> > Best！
> > Xuyang
>

Re: [DISCUSS] FLIP-480: Support to deploy script in application mode

2024-10-29 Thread Ron Liu

I also have some questions:

1. Whether all SQL commands such as DDL & DML & SELECT are supported.
2. How to determine JobID and return JobID & ClusterId from the application
cluster
3. How to dynamically download the JAR specified by the user when
submitting the sql script, and whether it is possible to specify a local
jar?

Best,
Ron

Ron Liu  于2024年10月30日周三 10:57写道：

> Hi, Shengkai
>
> Thanks for initializing this FLIP,  supports application mode for SQL
> Gateway is a great job. The FLIP design looks good to me.
>
>
> I've read the FLIP-316 which mentions supporting deploying SQL job to
> application clusters for interactive or non-interactive gateway mode.
> But I noticed that you say this FLIP focuses on supporting deploy sql
> scripts to the application cluster, does it mean that it only supports
> non-interactive gateway mode?
>
>
> Best,
> Ron
>
> Shengkai Fang  于2024年10月29日周二 14:46写道：
>
>> Hi, HongShun. Thanks a lot for your response!
>>
>> > I wonder what is the scope of this FLIP, only aim for k8s, not including
>> yarn?
>>
>> This FLIP also works for the yarn-application mode. But the yarn
>> deployment
>> doesn't support to ship the artifacts into the remote side. Please
>> correct me if I'm wrong.
>>
>> > When talking about "custom", you mean these also will have some builtin
>> implementations? If it exists, how to get their location in dfs based on
>> SQL? Depending on some configuration or just convention over
>> configuration.
>>
>> I think the builtin artfacts are catalogs/connectors/udf that are located
>> at the $FLINK_HOME/lib directory.
>>
>> > Is the FLIP-316 still in need later?
>>
>> Yes. I think FLIP-316 is a great idea to use json plan to run the SQL Job
>> and it brings great convenience to users to submit job in application mode
>> in interactive mode.
>>
>> Best,
>> Shengkai
>>
>>
>>
>>
>> Shengkai Fang  于2024年10月29日周二 14:25写道：
>>
>> > Hi, Feng.
>> >
>> > Thanks for your response.
>> >
>> > > Will FLIP-316 merge into Flink 2.0 too ?
>> >
>> > I don't have time to finish the FLIP-316. So it depends on whether
>> anyone
>> > else can help to continue the discussion.
>> >
>> > > Will SqlDriver use the same one?
>> >
>> > Yes. We should reuse the same driver. I think the driver is the
>> entrypoint
>> > for the SQL script.
>> >
>> >
>> > > The details SQL-client deploy SQL File to Cluster may not be very
>> clear ?
>> >
>> > I have pushed a PoC branch about the change. Please take a look at
>> > https://github.com/fsk119/flink/tree/application-mode (I don't test it
>> > yet). At the mean time, I add a new method in the SqlGatewayService to
>> > describe the change.
>> >
>> > Best,
>> > Shengkai
>> >
>> >
>> >
>> > Feng Jin  于2024年10月25日周五 21:15写道：
>> >
>> >> Hi, Shenkai
>> >>
>> >> Thank you for initiating this FLIP, I understand that supporting
>> >> application mode for SQL gateway is very important. There are two small
>> >> issues.
>> >>
>> >> > FLIP-480 is different from FLIP-316
>> >>
>> >>
>> >>1. Will FLIP-316 merge into Flink 2.0 too ?
>> >>
>> >>
>> >>2. Will SqlDriver use the same one?
>> >>
>> >>
>> >> The details SQL-client deploy SQL File to Cluster may not be very
>> clear ?
>> >>
>> >> I guess that some modifications need to be made to the client here,
>> >> when deploying scripts in application mode, we need to call the newly
>> >> added
>> >> interface of the gateway service.
>> >>
>> >>
>> >> Best,
>> >> Feng
>> >>
>> >>
>> >> On Thu, Oct 24, 2024 at 4:27 PM Shengkai Fang 
>> wrote:
>> >>
>> >> > Hi, everyone.
>> >> >
>> >> > I'd like to initiate a discussion about FLIP-480: Support to deploy
>> >> script
>> >> > in application mode[1].
>> >> >
>> >> > FLIP-480 supports to solve the problem that table program can not
>> run in
>> >> > application mode. Comparing to FLIP-316[2], FLIP-480 tries to compile
>> >> the
>> >> > script in the JM side, which is free from the limitation of the JSON
>> >> > plan(JSON plan only serialize the identifier for temporary object) .
>> >> >
>> >> > For more details, please refer to the FLIP[1]. Welcome any feedback
>> and
>> >> > suggestions for improvement.
>> >> >
>> >> > Best,
>> >> > Shengkai
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-480%3A+Support+to+deploy+SQL+script+in+application+mode
>> >> > [2]
>> >> >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-316%3A+Support+application+mode+for+SQL+Gateway?src=contextnavpagetreemode
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] FLIP-480: Support to deploy script in application mode

2024-10-29 Thread Ron Liu

Hi, Shengkai

Thanks for initializing this FLIP,  supports application mode for SQL
Gateway is a great job. The FLIP design looks good to me.


I've read the FLIP-316 which mentions supporting deploying SQL job to
application clusters for interactive or non-interactive gateway mode.
But I noticed that you say this FLIP focuses on supporting deploy sql
scripts to the application cluster, does it mean that it only supports
non-interactive gateway mode?


Best,
Ron

Shengkai Fang  于2024年10月29日周二 14:46写道：

> Hi, HongShun. Thanks a lot for your response!
>
> > I wonder what is the scope of this FLIP, only aim for k8s, not including
> yarn?
>
> This FLIP also works for the yarn-application mode. But the yarn deployment
> doesn't support to ship the artifacts into the remote side. Please
> correct me if I'm wrong.
>
> > When talking about "custom", you mean these also will have some builtin
> implementations? If it exists, how to get their location in dfs based on
> SQL? Depending on some configuration or just convention over configuration.
>
> I think the builtin artfacts are catalogs/connectors/udf that are located
> at the $FLINK_HOME/lib directory.
>
> > Is the FLIP-316 still in need later?
>
> Yes. I think FLIP-316 is a great idea to use json plan to run the SQL Job
> and it brings great convenience to users to submit job in application mode
> in interactive mode.
>
> Best,
> Shengkai
>
>
>
>
> Shengkai Fang  于2024年10月29日周二 14:25写道：
>
> > Hi, Feng.
> >
> > Thanks for your response.
> >
> > > Will FLIP-316 merge into Flink 2.0 too ?
> >
> > I don't have time to finish the FLIP-316. So it depends on whether anyone
> > else can help to continue the discussion.
> >
> > > Will SqlDriver use the same one?
> >
> > Yes. We should reuse the same driver. I think the driver is the
> entrypoint
> > for the SQL script.
> >
> >
> > > The details SQL-client deploy SQL File to Cluster may not be very
> clear ?
> >
> > I have pushed a PoC branch about the change. Please take a look at
> > https://github.com/fsk119/flink/tree/application-mode (I don't test it
> > yet). At the mean time, I add a new method in the SqlGatewayService to
> > describe the change.
> >
> > Best,
> > Shengkai
> >
> >
> >
> > Feng Jin  于2024年10月25日周五 21:15写道：
> >
> >> Hi, Shenkai
> >>
> >> Thank you for initiating this FLIP, I understand that supporting
> >> application mode for SQL gateway is very important. There are two small
> >> issues.
> >>
> >> > FLIP-480 is different from FLIP-316
> >>
> >>
> >>1. Will FLIP-316 merge into Flink 2.0 too ?
> >>
> >>
> >>2. Will SqlDriver use the same one?
> >>
> >>
> >> The details SQL-client deploy SQL File to Cluster may not be very clear
> ?
> >>
> >> I guess that some modifications need to be made to the client here,
> >> when deploying scripts in application mode, we need to call the newly
> >> added
> >> interface of the gateway service.
> >>
> >>
> >> Best,
> >> Feng
> >>
> >>
> >> On Thu, Oct 24, 2024 at 4:27 PM Shengkai Fang 
> wrote:
> >>
> >> > Hi, everyone.
> >> >
> >> > I'd like to initiate a discussion about FLIP-480: Support to deploy
> >> script
> >> > in application mode[1].
> >> >
> >> > FLIP-480 supports to solve the problem that table program can not run
> in
> >> > application mode. Comparing to FLIP-316[2], FLIP-480 tries to compile
> >> the
> >> > script in the JM side, which is free from the limitation of the JSON
> >> > plan(JSON plan only serialize the identifier for temporary object) .
> >> >
> >> > For more details, please refer to the FLIP[1]. Welcome any feedback
> and
> >> > suggestions for improvement.
> >> >
> >> > Best,
> >> > Shengkai
> >> >
> >> > [1]
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-480%3A+Support+to+deploy+SQL+script+in+application+mode
> >> > [2]
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-316%3A+Support+application+mode+for+SQL+Gateway?src=contextnavpagetreemode
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-480: Support to deploy script in application mode

2024-10-30 Thread Ron Liu

Hi, Shengkai

Thanks for your quick response. It looks good to me.

Best
Ron

Shengkai Fang  于2024年10月31日周四 10:08写道：

> Hi, Ron!
>
> >  I noticed that you say this FLIP focuses on supporting deploy sql
> scripts to the application cluster, does it mean that it only supports
> non-interactive gateway mode?
>
> Yes. This FLIP only supports to deploy a script in non-interactive mode.
>
> > Whether all SQL commands such as DDL & DML & SELECT are supported.
>
> We supports all SQL commands and the execution results are visible in the
> JM log. But application cluster has some limitations that only one job is
> allowed to run in the dedicated cluster.
>
> >  How to dynamically download the JAR specified by the user when
> submitting the sql script, and whether it is possible to specify a local
> jar?
>
> This is a good question. I think it's totally up to the deployment api. For
> example, kubernetes deployment provides the option
> `kubernetes-artifacts-local-upload-enabled`[1] to upload the artifact to
> the DFS but yarn deployment doesn't support to ship the artifacts to DFS in
> application mode. If runtime API can provide unified interface, I think we
> can use the unified API to upload local artifacts. Alternatively, we can
> provide a special service that allows sql-gateway to support pulling jar.
> You can read the future work for more details.
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#kubernetes-artifacts-local-upload-enabled
>
>
> Shengkai Fang  于2024年10月31日周四 09:30写道：
>
> > Hi, Feng!
> >
> > > if only clusterID is available, it may not be very convenient to
> connect
> > to this application later on.
> >
> > If FLIP-479 is accepted, I think we can just adapt the sql-gateway
> > behaviour to the behaviour that FLIP-479 mentioned.
> >
> >
> > Best,
> > Shengkai
> >
> >
>

Re: [VOTE]FLIP-480: Support to deploy script in application mode

2024-11-11 Thread Ron Liu

+1(binding)

Best,
Ron

Gyula Fóra  于2024年11月12日周二 01:27写道：

> + 1(binding)
>
> Thanks for answering my concerns/questions.
>
> Gyula
>
> On Fri, Nov 8, 2024 at 11:16 AM Gyula Fóra  wrote:
>
> > Hey!
> >
> > Sorry, bit late to the party, I have added a concern to the discussion
> > related to the gateway submission vote.
> >
> > I would like to clarify that before we close this vote.
> >
> > Cheers,
> > Gyula
> >
> > On Fri, Nov 8, 2024 at 10:57 AM Feng Jin  wrote:
> >
> >> +1 (non-binding)
> >>
> >>
> >> Best,
> >> Feng Jin
> >>
> >> On Fri, Nov 8, 2024 at 5:37 PM yuanfeng hu  wrote:
> >>
> >> > +1(no-binding)
> >> >
> >> >
> >> > Shengkai Fang  于2024年11月8日周五 15:11写道：
> >> >
> >> > > Hi everyone,
> >> > >
> >> > > I'd like to start a vote on FLIP-480: Support to deploy script in
> >> > > application mode[1]. The discussion can be found here[2].
> >> > >
> >> > > The vote will be open for at least 72 hours unless there are any
> >> > objections
> >> > > or insufficient votes.
> >> > >
> >> > > Best,
> >> > > Shengkai
> >> > >
> >> > > [1]
> >> > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-480%3A+Support+to+deploy+SQL+script+in+application+mode
> >> > > [2]
> https://lists.apache.org/thread/g3ohzbogww1g8zl7zlmn84fsk29qr568
> >> > >
> >> >
> >> >
> >> > --
> >> > Best,
> >> > Yuanfeng
> >> >
> >>
> >
>

Re: [ANNOUNCE] New Apache Flink Committer - Junrui Li

2024-11-07 Thread Ron Liu

Congratulations Junrui

Best,
Ron

wenjin  于2024年11月7日周四 11:57写道：

> Congratulations Junrui~
>
> Best regards,
> Wenjin
>
> > 2024年11月5日 19:59，Zhu Zhu  写道：
> >
> > Hi everyone,
> >
> > On behalf of the PMC, I'm happy to announce that Junrui Li has become a
> > new Flink Committer!
> >
> > Junrui has been an active contributor to the Apache Flink project for two
> > years. He had been the driver and major developer of 8 FLIPs, contributed
> > 100+ commits with tens of thousands of code lines.
> >
> > His contributions mainly focus on enhancing Flink batch execution
> > capabilities, including enabling parallelism inference by
> default(FLIP-283),
> > supporting progress recovery after JM failover(FLIP-383), and supporting
> > adaptive optimization of logical execution plan (FLIP-468/469).
> Furthermore,
> > Junrui did a lot of work to improve Flink's configuration layer,
> addressing
> > technical debt and enhancing its user-friendliness. He is also active in
> > mailing lists, participating in discussions and answering user questions.
> >
> > Please join me in congratulating Junrui Li for becoming an Apache Flink
> > committer.
> >
> > Best,
> > Zhu (on behalf of the Flink PMC)
>
>

Re: [VOTE] CHI: Stale PR cleanup

2025-01-06 Thread Ron Liu

+1(binding)

Best,
Ron

David Radley  于2025年1月7日周二 00:43写道：

> Thanks for driving this Tom.
> +1 (non-binding)
> David
>
> From: Gyula Fóra 
> Date: Monday, 6 January 2025 at 16:33
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] Re: [VOTE] CHI: Stale PR cleanup
> +1 (binding)
> Gyula
>
> On Mon, Jan 6, 2025 at 5:12 PM Tom Cooper  wrote:
>
> > Hi all,
> >
> > Following on from my proposal [1] at the end of last year. I was hoping
> to
> > get a vote on enabling the stale PR Github Action in the Flink
> repository.
> >
> > There was positive feedback on the proposal and the most popular initial
> > configuration was notifying when a PR has been untouched for 6 months and
> > then closing after a further 3 months of inactivity.
> > We can review this configuration at a later date, with the aim of
> bringing
> > it in line with other Apache projects (3 months inactive with 1 month to
> > reactivate).
> >
> > Regards,
> >
> > Tom Cooper
> > https://tomcooper.dev
> >
> > [1] https://lists.apache.org/thread/6yoclzmvymxors8vlpt4nn9r7t3stcsz
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> Winchester, Hampshire SO21 2JN
>

Re: Re:[DISSCUSS] Removing default error methods from the new public api stack in the table api when it implements both the new and deprecated public api stacks and the old stack is removed

2025-01-02 Thread Ron Liu

Hi, Xuyang

Thanks for initialized this proposal. +1 for remove the default
implementation because 2.0 is breaking change, we don't need to keep api
compatibility.

Best,
Ron

Feng Jin  于2025年1月3日周五 13:36写道：

> +1 Directly remove the default implementation, as this allows issues to be
> detected at compile time rather than at runtime. Even if we keep these
> methods, it doesn’t seem to make much sense.
>
>
> Best,
> Feng Jin
>
>
> On Fri, Jan 3, 2025 at 12:57 PM Xuyang  wrote:
>
> > +1 to delete the overrided default methods that throw exceptions.
> >
> >
> >
> >
> > --
> >
> > Best！
> > Xuyang
> >
> >
> >
> >
> > 在 2025-01-03 12:55:38，"Xuyang"  写道：
> >
> > The discuss will be open for at least 72 hours to collect dicisions.
> >
> > --
> >
> > Best！
> > Xuyang
> >
> >
> >
> >
> >
> > At 2025-01-03 12:06:38, "Xuyang"  wrote:
> > >Hi, all.
> > >
> > >Feng Jing, Ron Liu and I are are currently working on removing
> deprecated
> > public APIs from the table module and have encountered an issue.
> > >
> > >Take CatalogFactory as an example, the CatalogFactory class implements
> > both the new api Factory and the deprecated api TableFactory. Within
> > >
> > >CatalogFactory, all methods of the new api Factory have default
> > implementations that currently return errors, providing compatibility
> with
> > the old
> > >
> > >api TableFactory. As we move forward with the removal of the deprecated
> > api TableFactory, we are uncertain about whether to remove the default
> > >
> > >methods from the new api Factory.
> > >
> > >There are two options:
> > >
> > >1. Leave it unchanged for now: The advantage of this approach is that we
> > adhere to the principle of not modifying public APIs at all. However, I
> > believe that having missed the release-2.0, we may not have another
> > opportunity to remove it in the near future. Additionally, this option is
> > not very user-friendly for custom connectors, as they would only discover
> > the absence of certain methods at runtime.
> > >
> > >2. Directly remove the default implementations where throw exceptions
> > directly: The benefit of this approach is that it informs users at
> compile
> > time that they must implement certain methods. The downside is that we
> are
> > "slightly" modifying the public API methods. Connectors that are very old
> > and only compatible with the old API stack without integrating the new
> API
> > stack would encounter compile-time errors (although in any case, they
> would
> > ultimately face runtime errors, wouldn’t they?).
> > >
> > >We would appreciate any insights or suggestions regarding this decision.
> > Thank you!
> > >
> > >
> > >
> > >
> > >Please note that similar situations exist with ModuleFactory and others;
> > we won't discuss each one individually going forward.
> > >
> > >
> > >
> > >
> > >--
> > >
> > >Best！
> > >Xuyang
> >
>

Re: [VOTE] FLIP-492: Support Query Modifications for Materialized Tables

2024-12-30 Thread Ron Liu

+1(binding)

Best,
Ron

Feng Jin  于2024年12月30日周一 09:47写道：

> Hi everyone,
>
> I'd like to start a vote on FLIP-492: Support Query Modifications for
> Materialized Tables [1]
> which has been discussed in this thread [2].
>
> The vote will be open for at least 72 hours unless there is an objection
> or not enough votes.
>
>
> [1].
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-492%3A+Support+Query+Modifications+for+Materialized+Tables
> [2]. https://lists.apache.org/thread/bvgtdwfbnlq5j5lb97dzyj66dlxt5mng
>
>
> Best,
> Feng Jin
>

Re: [VOTE] FLIP-489: Add missing dropTable/dropView methods to TableEnvironment

2024-12-15 Thread Ron Liu

+1(binding)

Best,
Ron

Leonard Xu  于2024年12月15日周日 16:12写道：

> Thanks Sergey for driving this FLIP,
>
> +1(binding)
>
> Best,
> Leonard
>
>
>
>
> > On Dec 13, 2024, at 6:48 PM, Timo Walther  wrote:
> >
> > +1 (binding)
> >
> > Thanks,
> > Timo
> >
> >
> > On 13.12.24 01:37, Jim Hughes wrote:
> >> Hi Sergey,
> >> +1 (non-binding)
> >> Cheers,
> >> Jim
> >> On Thu, Dec 12, 2024 at 2:51 PM Sergey Nuyanzin 
> wrote:
> >>> Hi everyone,
> >>>
> >>> I'd like to start a vote on FLIP-489: Add missing dropTable/dropView
> >>> methods
> >>> to TableEnvironment [1] which has been discussed in this thread [2].
> >>>
> >>> The vote will be open for at least 72 hours unless there is an
> objection
> >>> or not enough votes.
> >>>
> >>> [1]
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499781
> >>> [2] https://lists.apache.org/thread/3b6ohofz4hqvzxmxjwnls9m09s1ch7sk
> >>> --
> >>> Best regards,
> >>> Sergey
> >>>
> >
>
>

Re: [DISCUSS] FLIP-494: Add missing createTable/createView methods to TableEnvironment

2024-12-17 Thread Ron Liu

Hi Sergey

Thanks for driving this FLIP, +1.

It may be a minor typo for the createTable method doc.

```

 * {@code
 * tEnv.createTemporaryTable("MyTable", TableDescriptor.forConnector("datagen")
 * .schema(Schema.newBuilder()
 * .column("f0", DataTypes.STRING())
 * .build())
 * .option(DataGenOptions.ROWS_PER_SECOND, 10)
 * .option("fields.f0.kind", "random")
 * .build());
 * }

```

Should `tEnv.createTemporaryTable` be `tEnv.createTable`?


Best
Ron

Sergey Nuyanzin  于2024年12月18日周三 06:12写道：

> Hi all!
>
> I would like to open up for discussion another new small FLIP-494[1].
>
> Motivation
> This FLIP continues the journey of bringing Table API and FlinkSQL closer
> to each other.
> It adds missing create methods to TableEnvironment (alter functionality
> will be added in follow up FLIP(s)).
> There are already existing createTemporaryTable, createTable and
> createTemporaryView however still missing createView and createTable with
> ignoreIfExists flag
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760466
>
>
> --
> Best regards,
> Sergey
>

Re: [DISCUSS] FLIP-492: Support Query Modifications for Materialized Tables.

2024-12-18 Thread Ron Liu

Hi, Feng

The reply looks good to me. But I have one question: You mentioned the
`DESC MATERIALIZED TABLE` syntax in FLIP, but we didn't provide this syntax
until now. I think we should add it to this FLIP if needed.

Best,
Ron

Feng Jin  于2024年12月18日周三 16:52写道：

> Hi Ron
>
> Thanks for your reply.
>
> >  Is it only possible to add columns at the end and not anywhere in
> table schema, some databases have this limitation, does lake storage such
> as Iceberg/Paimon have this limitation?
>
>
>  Currently, we can restrict adding columns only to the end of the schema.
> Although both Paimon and Iceberg already support adding columns anywhere,
> there are still some systems that do not. I will include this in the FLIP.
>
>
> > In the Refresh Task Behavior section you mention partition hints, is it
> possible to clarify what it is in the FLIP？
>
>
> I have added the relevant details.
>
>
> >  Are you able to articulate the default behavior?
>
>
> The detailed explanation for this part has been updated.
>
>
> >  How users can determine if states are compatible？
>
>
> Users can only rely on their experience to make modifications. Currently,
> the Flink framework does not guarantee that changes to SQL logic will
> maintain state compatibility.
>
> I think we can add some suggestions in the user documentation in the
> future. While the framework itself cannot ensure state compatibility, some
> simple modification scenarios can indeed be compatible.
>
> For now, the responsibility is left to the users.
>
>
> Even if recovery ultimately fails, users still have the option to roll
> back to the original query or start consuming from a new offset by
> disabling recovery parameters.
>
>
>
>
> Best,
> Feng
>
>
> On Tue, Dec 17, 2024 at 10:37 AM Ron Liu  wrote:
>
>> Hi Feng
>>
>> Thanks for initiating this FLIP, in lakehouse, Schema Evolution of tables
>> due to modification of business logic is a very common scenario, so
>> Materialized Table's support for modification of Query can greatly improve
>> flexibility and usability, and we've seen that other similar products in
>> the industry also support this capability.
>>
>> I read the content of this FLIP and the overall design looks good, +1.
>> However, I have some questions as follows:
>>
>> 1. By `ALTER MATERIALIZED TABLE ... AS select` statement to realize the
>> add column logic, is it only possible to add columns at the end and not
>> anywhere in table schema, some databases have this limitation, does lake
>> storage such as Iceberg/Paimon have this limitation?
>> 2. In the Refresh Task Behavior section you mention partition hints, is
>> it possible to clarify what it is in the FLIP？
>>
>> >>> *CONTINUOUS Mode: *Stops the old job and starts a new one with the
>> updated query.
>>
>>- The initial position of the new job is controlled by the source
>>parameters.
>>- For compatible logic changes, recovery parameters
>>(execution.state-recovery.path)  can be manually set if state 
>> compatibility
>>is confirmed.
>>
>>
>> 4. Are you able to articulate the default behavior?
>> 5. How users can determine if states are compatible？
>>
>> Best,
>> Ron
>>
>> Feng Jin  于2024年12月16日周一 10:49写道：
>>
>>> Hi, everyone,
>>>
>>> I’d like to initiate a discussion on FLIP-492: Support Query
>>> Modifications for Materialized Tables[1].
>>>
>>> In FLIP-435[2], we introduced *MATERIALIZED TABLES*. By defining query
>>> logic and specifying data freshness requirements, users can efficiently
>>> build data pipelines, greatly improving development productivity.
>>> FLIP-492 builds on this by addressing a common need: the ability to
>>> modify the query logic of an existing MATERIALIZED TABLE. Two approaches
>>> are proposed:
>>>
>>>
>>> *1. Modifying the Query Logic: ALTER MATERIALIZED TABLE AS *
>>> Retain historical data while modifying the query logic:
>>>
>>> ```
>>> ALTER MATERIALIZED TABLE [catalog_name.][db_name.]table_name AS 
>>> ```
>>>
>>>
>>> *2. Replacing the Table: CREATE OR REPLACE MATERIALIZED TABLE*
>>> Reconstruct the table with a new query, discarding all historical data:
>>>
>>> ```
>>> CREATE [OR REPLACE] MATERIALIZED TABLE
>>> [catalog_name.][db_name.]table_name
>>> [ ([]) ]
>>> [COMMENT table_comment]
>>> [PARTITIONED BY (partition_column_name1, partition_colum

Re: [DISCUSSION] flink-connector-hive first release of externalized connector

2025-01-13 Thread Ron Liu

+1

Best,
Ron

yuxia  于2025年1月14日周二 10:17写道：

> Thanks Sergey for driving this work again..
> +1 for externalizing hive connector to finish all connector externalizing
> for Flink..
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "Jark Wu" 
> 收件人: "dev" 
> 发送时间: 星期二, 2025年 1 月 14日 上午 10:06:42
> 主题: Re: [DISCUSSION] flink-connector-hive first release of externalized
> connector
>
> +1 for externalizing hive connector
>
> Best,
> Jark
>
> On Tue, 14 Jan 2025 at 10:01, Lincoln Lee  wrote:
>
> > Thanks Sergey for driving this important work!
> >
> > +1 for finishing the externalize.
> >
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Sergey Nuyanzin  于2025年1月14日周二 06:16写道：
> >
> > > Hi everyone,
> > >
> > > This is yet another iteration to request/propose release of external
> Hive
> > > connector.
> > > Unfortunately the first attempt (about a year ago[1]) was not able to
> > > attract enough binding votes.
> > >
> > > There is already a discussion about this in the comments of [2].
> > > And also want to bring to ML.
> > > Now flink-connector-hive is updated to 1.20.0 and if we release it,
> then
> > > IIUC we could then remove it from 2.0.
> > > I can volunteer as an RM for this however I would need some help from
> > PMC.
> > >
> > > [1] https://lists.apache.org/thread/40wl148bmcjzw4jxxo5z3rv2o6387fj1
> > > [2] https://issues.apache.org/jira/browse/FLINK-37097
> > > --
> > > Best regards,
> > > Sergey
> > >
> >
>

Re: [VOTE] FLIP-494: Add missing createTable/createView methods to TableEnvironment

2024-12-23 Thread Ron Liu

+1(binding)

Best,
Ron

Sergey Nuyanzin  于2024年12月20日周五 18:52写道：

> Hi everyone,
>
> I'd like to start a vote on FLIP-494: Add missing createTable/createView
> methods to TableEnvironment [1] which has been discussed in this thread
> [2].
>
> The vote will be open for at least 72 hours unless there is an objection
> or not enough votes.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760466
> [2] https://lists.apache.org/thread/k1dgx03xmbo3pw3hxtqjr8wcvz0vhdjs
>
> --
> Best regards,
> Sergey
>

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-12 Thread Ron Liu

Hi, Xiangyu

Thank you for proposing this FLIP, it's great work and looks very useful
for users.

I have the following two questions regarding the content of the FLIP:
1. Since sink reuse is very useful, should the default value be true for
the newly introduced option `table.optimizer.reuse-sink-enabled`, and
should the engine enable this optimization by default. Currently for source
reuse, the default value of  `sql.optimizer.reuse.table-source.enabled`
option is also true, which does not require user access by default, so I
think the engine should turn on Sink reuse optimization by default.
2. Regarding Sink Digest, you mentioned disregarding the sink target
column, which I think is a very good suggestion, and very useful if it can
be done. I have a question: have you considered the technical
implementation options and are they feasible?

Best,
Ron

xiangyu feng  于2025年2月13日周四 12:56写道：

> Hi all,
>
> Thank you all for the comments.
>
> If there is no further comment, I will open the voting thread in 3 days.
>
> Regards,
> Xiangyu
>
> xiangyu feng  于2025年2月11日周二 14:17写道：
>
> > Link for Paimon LocalMerge Operator[1]
> >
> > [1]
> >
> https://paimon.apache.org/docs/master/maintenance/write-performance/#local-merging
> >
> > xiangyu feng  于2025年2月11日周二 14:03写道：
> >
> >> Follow the above,
> >>
> >> "And for SinkWriter, the data structure to be processed should be
> fixed."
> >>
> >> I'm not very sure why the data structure of SinkWriter should be fixed.
> >> Can you elaborate the scenario here?
> >>
> >>  "Is there a node or an operator to fill in the inconsistent field of
> >> Rowdata that passed from different Sources?"
> >>
> >> By `filling in the inconsistent field from different sources`, do you
> >> refer to implementations like the LocalMerge Operator [1] for Paimon?
> IMHO,
> >> this should not be included in the Sink Reuse. The merging behavior of
> >> multiple sources should be considered inside of the sink.
> >>
> >> Regards,
> >> Xiangyu Feng
> >>
> >> xiangyu feng  于2025年2月11日周二 13:46写道：
> >>
> >>> Hi Yanquan,
> >>>
> >>> Thx for reply. IIUC, the schema of CatalogTable should contain all
> >>> target columns for sources. If not, a SQL validation exception should
> be
> >>> raised for planner.
> >>>
> >>> Regards,
> >>> Xiangyu Feng
> >>>
> >>>
> >>>
> >>> Yanquan Lv  于2025年2月10日周一 16:25写道：
> >>>
>  Hi, Xiangyu. Thanks for driving this.
> 
>  I have a question to confirm:
>  Considering the case that different Sources use different columns[1],
>  will the Schema of CatalogTable[2] contain all target columns for
> Sources?
>  And for SinkWriter, the data structure to be processed should be
> fixed.
>  Is there a node or an operator to fill in the inconsistent field of
> Rowdata
>  that passed from different Sources?
> 
>  [1]
> 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>  [2]
> 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sourcessinks/#planning
> 
> 
> 
>  > 2025年2月6日 17:06，xiangyu feng  写道：
>  >
>  > Hi devs,
>  >
>  > I'm opening this thread to discuss FLIP-506: Support Reuse Multiple
>  Table
>  > Sinks in Planner[1].
>  >
>  > Currently if users want to partial-update a downstream table from
>  multiple
>  > source tables in one datastream, they would have to manually union
> all
>  > source tables and add lots of "cast(null as string) as xxx" in Flink
>  SQL.
>  > This will make the SQL here hard to use and maintain.
>  >
>  > After discussing with Weijie Guo, we think that by supporting reuse
>  sink
>  > nodes in planner, the usability can be greatly improved in this
> case.
>  >
>  > Therefore, we propose to add a new option
>  > *`table.optimizer.reuse-sink-enabled`* here to support this feature.
>  More
>  > details can be found in the FLIP.
>  >
>  > Looking forward to your feedback, thanks.
>  >
>  > [1]
>  >
> 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>  >
>  > Best regards,
>  > Xiangyu Feng
> 
> 
>

Re: [DISCUSS] Planning Flink 2.1

2025-03-25 Thread Ron Liu

Hi, David

Thanks for your kind reminder, we are planning the next minor release 2.1.

Thanks also to Xingtong for additional background information.

Best,
Ron

Xintong Song  于2025年3月26日周三 11:08写道：

> Thanks for the kick off and volunteering as release manager, Ron.
>
> +1 for start preparing release 2.1 and Ron as the release manager. The
> proposed feature freeze date sounds good to me.
>
> @David,
>
> Just to provide some background information.
>
> In the Flink community, we use the terminologies Major, Minor and Bugfix
> releases, represented by the three digits in the version number
> respectively. E.g., 1.20.1 represents Major version 1, Minor version 20 and
> Bugfix version 1.
>
> The major differences are:
> 1. API compatibility: @Public APIs are guaranteed compatible across
> different Minor and Bugfix releases of the same Major
> version. @PublicEvolving APIs are guaranteed compatible across Bugfix
> releases of the same Minor version.
> 2. Bugfix releases typically only include bugfixes, but not feature /
> improvement code changes. Major and Minor releases include both feature /
> improvements and bugfixes.
>
> The Flink community applies a time-based release planning [1] for Major /
> Minor releases. We try to deliver new features with a new release
> (typically Minor, unless we decide to break the API compatibility with a
> Major release) roughly every 4 months. As for bugfix releases, they are
> usually planned on-demand. E.g., to make a critical bugfix available to
> users, or observed significant bugfix issues have been merged for the
> release branch.
>
> Best,
>
> Xintong
>
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/Time-based+releases
>
>
>
> On Wed, Mar 26, 2025 at 12:14 AM David Radley 
> wrote:
>
> > Hi ,
> > I am wondering what the thinking is about calling the next release 2.1
> > rather than 2.0.1.  This numbering implies to me you are considering a
> > minor release rather than a point release. For example
> > https://commons.apache.org/releases/versioning.html,
> >
> > Kind regards, David.
> >
> >
> > From: Jark Wu 
> > Date: Tuesday, 25 March 2025 at 06:52
> > To: dev@flink.apache.org 
> > Subject: [EXTERNAL] Re: [DISCUSS] Planning Flink 2.1
> > Thanks, Ron Liu for kicking off the 2.1 release.
> >
> > +1 for Ron Liu to be the release manager and
> > +1 for the feature freeze date
> >
> > Best,
> > Jark
> >
> > On Mon, 24 Mar 2025 at 16:33, Ron Liu  wrote:
> >
> > > Hi everyone,
> > > With the release announcement of Flink 2.0, it's a good time to kick
> off
> > > discussion of the next release 2.1.
> > >
> > > - Release Managers
> > >
> > > I'd like to volunteer as one of the release managers this time. It has
> > been
> > > good practice to have a team of release managers from different
> > > backgrounds, so please raise you hand if you'd like to volunteer and
> get
> > > involved.
> > >
> > >
> > > - Timeline
> > >
> > > Flink 2.0 has been released. With a target release cycle of 4 months,
> > > we propose a feature freeze date of *June 21, 2025*.
> > >
> > >
> > > - Collecting Features
> > >
> > > As usual, we've created a wiki page[1] for collecting new features in
> > 2.1.
> > >
> > > In the meantime, the release management team will be finalized in the
> > next
> > > few days, and we'll continue to create Jira Boards and Sync meetings
> > > to make it easy for everyone to get an overview and track progress.
> > >
> > >
> > > Best,
> > > Ron
> > >
> > > [1] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> > Winchester, Hampshire SO21 2JN
> >
>

[KINDLY REMINDER] Flink 2.1 Release Sync 23/04/2025

2025-04-22 Thread Ron Liu

Hi, everyone!

The first release sync will be held at the following time.
- 14:00, Apr 23, BJT (GMT+8)
- 08:00, Apr 23, CET (GMT+2)
- 23:00, Apr 22, PT (GMT-7)

Everyone is welcome to join via Google Meet [1]. The topics to be discussed
can be found at the Status / Follow-ups section of the release wiki page
[2]. If there's other topics that you'd like to discuss, please feel free
to add to the list.

Best,
Ron

[1] https://meet.google.com/pog-pebb-zrj
[2] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release

Re: [ANNOUNCE] New Apache Flink Committer - Yanquan Lv

2025-04-17 Thread Ron Liu

Congratulations!

Best,
Ron

shuai xu  于2025年4月17日周四 15:50写道：

> Congratulations, Yanquan!
>
> Best,
> Xu Shuai
>
> > 2025年4月17日 14:27，Xuyang  写道：
> >
> > Congratulations, Yanquan!
>
>

Re: Re: [ANNOUNCE] New Apache Flink Committer - Xiqian Yu

2025-04-17 Thread Ron Liu

Congratulations!

Best,
Ron

Xuyang  于2025年4月17日周四 14:22写道：

> Congratulations, Xiqian!
>
>
>
>
>
> --
>
> Best！
> Xuyang
>
>
>
>
>
> 在 2025-04-17 14:01:28，"shuai xu"  写道：
> >Congratulations!
> >
> >Best,
> >Xu Shuai
> >
> >> 2025年4月17日 13:54，Lincoln Lee  写道：
> >>
> >> Congratulations, Xiqian!
> >>
> >>
> >> Best,
> >> Lincoln Lee
> >>
> >>
> >> Shengkai Fang  于2025年4月17日周四 11:51写道：
> >>
> >>> Congratulations!
> >>>
> >>> Best
> >>> Shengkai
> >>>
> >>> Hongshun Wang  于2025年4月17日周四 10:39写道：
> >>>
>  Congratulations!
> 
>  Best
>  Hongshun
> 
> > 2025年4月16日 15:25，Leonard Xu  写道：
> >
> > becoming
> 
> 
> >>>
>

[SUMMARY] Flink 2.1 Release Sync 04/23/2025

2025-04-23 Thread Ron Liu

Hi, devs

Today we had our first meeting for Flink 2.1 release cycle. I'd like to
share the info
synced in the meeting.

1. Feature freezing and release:

   - Feature Freeze: June 21, 2025, 00:00 CEST(UTC+2)
   - Release: End of July 2025

2. Update 2.1 Release page:

Flink 2.1 release has been kicked off. We'd like to invite you to update
your development plan on the release page[1].

4. Blocker Issues

There are no blocker issues, but two unstable tests. I have created the
corresponding issues to track it
- https://issues.apache.org/jira/projects/FLINK/issues/FLINK-37701 稳定复现
- https://issues.apache.org/jira/projects/FLINK/issues/FLINK-37702

The next release sync up meeting will be on May 7, 2025. Please feel free
to join us[2]!

Best,
Ron

[1] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release
[2] https://meet.google.com/pog-pebb-zrj

Re: [DISCUSS] Planning Flink 2.1

2025-04-10 Thread Ron Liu

Hi everyone,

The discussion has been ongoing for a long time, and there is currently 1
RM candidate, so the final release manager of 2.1 is Ron Liu. We will do
the first release sync on April 23, 2025, at 9 am (UTC+2) and 3 pm (UTC+8).
Welcome to the meeting!

Best regards,
Ron

Ron Liu  于2025年3月26日周三 11:44写道：

> Hi, David
>
> Thanks for your kind reminder, we are planning the next minor release 2.1.
>
> Thanks also to Xingtong for additional background information.
>
> Best,
> Ron
>
> Xintong Song  于2025年3月26日周三 11:08写道：
>
>> Thanks for the kick off and volunteering as release manager, Ron.
>>
>> +1 for start preparing release 2.1 and Ron as the release manager. The
>> proposed feature freeze date sounds good to me.
>>
>> @David,
>>
>> Just to provide some background information.
>>
>> In the Flink community, we use the terminologies Major, Minor and Bugfix
>> releases, represented by the three digits in the version number
>> respectively. E.g., 1.20.1 represents Major version 1, Minor version 20
>> and
>> Bugfix version 1.
>>
>> The major differences are:
>> 1. API compatibility: @Public APIs are guaranteed compatible across
>> different Minor and Bugfix releases of the same Major
>> version. @PublicEvolving APIs are guaranteed compatible across Bugfix
>> releases of the same Minor version.
>> 2. Bugfix releases typically only include bugfixes, but not feature /
>> improvement code changes. Major and Minor releases include both feature /
>> improvements and bugfixes.
>>
>> The Flink community applies a time-based release planning [1] for Major /
>> Minor releases. We try to deliver new features with a new release
>> (typically Minor, unless we decide to break the API compatibility with a
>> Major release) roughly every 4 months. As for bugfix releases, they are
>> usually planned on-demand. E.g., to make a critical bugfix available to
>> users, or observed significant bugfix issues have been merged for the
>> release branch.
>>
>> Best,
>>
>> Xintong
>>
>>
>> [1] https://cwiki.apache.org/confluence/display/FLINK/Time-based+releases
>>
>>
>>
>> On Wed, Mar 26, 2025 at 12:14 AM David Radley 
>> wrote:
>>
>> > Hi ,
>> > I am wondering what the thinking is about calling the next release 2.1
>> > rather than 2.0.1.  This numbering implies to me you are considering a
>> > minor release rather than a point release. For example
>> > https://commons.apache.org/releases/versioning.html,
>> >
>> > Kind regards, David.
>> >
>> >
>> > From: Jark Wu 
>> > Date: Tuesday, 25 March 2025 at 06:52
>> > To: dev@flink.apache.org 
>> > Subject: [EXTERNAL] Re: [DISCUSS] Planning Flink 2.1
>> > Thanks, Ron Liu for kicking off the 2.1 release.
>> >
>> > +1 for Ron Liu to be the release manager and
>> > +1 for the feature freeze date
>> >
>> > Best,
>> > Jark
>> >
>> > On Mon, 24 Mar 2025 at 16:33, Ron Liu  wrote:
>> >
>> > > Hi everyone,
>> > > With the release announcement of Flink 2.0, it's a good time to kick
>> off
>> > > discussion of the next release 2.1.
>> > >
>> > > - Release Managers
>> > >
>> > > I'd like to volunteer as one of the release managers this time. It has
>> > been
>> > > good practice to have a team of release managers from different
>> > > backgrounds, so please raise you hand if you'd like to volunteer and
>> get
>> > > involved.
>> > >
>> > >
>> > > - Timeline
>> > >
>> > > Flink 2.0 has been released. With a target release cycle of 4 months,
>> > > we propose a feature freeze date of *June 21, 2025*.
>> > >
>> > >
>> > > - Collecting Features
>> > >
>> > > As usual, we've created a wiki page[1] for collecting new features in
>> > 2.1.
>> > >
>> > > In the meantime, the release management team will be finalized in the
>> > next
>> > > few days, and we'll continue to create Jira Boards and Sync meetings
>> > > to make it easy for everyone to get an overview and track progress.
>> > >
>> > >
>> > > Best,
>> > > Ron
>> > >
>> > > [1] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release
>> > >
>> >
>> > Unless otherwise stated above:
>> >
>> > IBM United Kingdom Limited
>> > Registered in England and Wales with number 741598
>> > Registered office: Building C, IBM Hursley Office, Hursley Park Road,
>> > Winchester, Hampshire SO21 2JN
>> >
>>
>

Recording Feature FLIP to Flink 2.1 Release Wiki

2025-04-17 Thread Ron Liu

Hi everyone,

Flink 2.1 has officially entered the release cycle, as the release manager,
I kindly remind you to register the relevant Feature FLIPs to be completed
in this release to the wiki document [1], so that we can better track the
progress and push Flink to keep moving forward.

[1] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release

Thanks!

Best,
Ron

[KINDLY REMINDER] Flink 2.1 Release Sync 07/05/2025

2025-05-06 Thread Ron Liu

Hi, everyone!

The next release sync will be held at the following time.
- 14:00, May 7, BJT (GMT+8)
- 08:00, May 7, CET (GMT+2)
- 23:00, May 6, PT (GMT-7)

Everyone is welcome to join via Google Meet [1]. The topics to be discussed
can be found at the Status / Follow-ups section of the release wiki page
[2]. If there's other topics that you'd like to discuss, please feel free
to add to the list.

Best,
Ron

[1] https://meet.google.com/pog-pebb-zrj
[2] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release

Re: [DISCUSS] FLIP-519: Introduce async lookup key ordered mode

2025-04-24 Thread Ron Liu

Hi, Shuai

Thanks for driving this proposal. The FLIP looks good to me overall, +1 for
it.

I have a small question about the table's contents. When
'table.exec.async-lookup.key-ordered-enabled=true' and
'table.exec.async-lookup.output-mode=ORDERD', the lookup join in key
ordered should be 'yes', not 'no'.
ORDERD is naturally guaranteed to be ordered, it just doesn't depend on the
current implementation of this proposal, which is a bit confusing here.


Best,
Ron

shuai xu  于2025年4月18日周五 10:50写道：

> Hi Xuyang,
>
>
> Thanks for your response. Actually, ALLOW_UNORDERED is only enabled when
> facing
>
> append-only streams for higher throughput. KEY_ORDERED is designed for
> scenarios
>
> where upsert key exists. Its goal is to achieve higher performance
> compared to the ORDERED
>
> mode while ensuring correctness is not compromised. In a word,
> ALLOW_UNORDERED
>
> mode does works in the presence of an upsert key.
>
>
> Aside from the literal difference in sequence, KEY_ORDER focuses on the
> processing order,
>
> while ALLOW_UNORDERED pertains to the output order. In ALLOW_UNORDERED
> mode,
>
> the processing order may be either ordered or unordered.
>
>
> Best,
> Xu Shuai
>
> > 2025年4月17日 14:53，Xuyang  写道：
> >
> > Hi, Shuai.
> >
> > This is a valuable addition to the current AsyncLookupJoin, and I’m
> >
> > generally in favor of it.
> >
> >
> >
> >
> > I have one question. Why do we need to introduce additional parameters
> >
> > to control KEY_ORDERED and ALLOW_UNORDERED? In other words,
> >
> > what scenarios require allowing users to perform completely unordered
> >
> > async lookup joins in the presence of an upsert key?
> >
> >
> >
> >
> > --
> >
> >Best！
> >Xuyang
> >
> >
> >
> >
> >
> > 在 2025-04-11 10:39:46，"shuai xu"  写道：
> >> Hi all,
> >>
> >> This FLIP will primarily focus on the implementation within the table
> module. As for support in the DataStream API, it will be addressed in a
> separate FLIP.
> >>
> >>> 2025年4月8日 09:57，shuai xu  写道：
> >>>
> >>> Hi devs,
> >>>
> >>> I'd like to start a discussion on FLIP-519: Introduce async lookup key
> >>> ordered mode[1].
> >>>
> >>> The Flink system currently supports both record-level ordered and
> >>> unordered output modes for asynchronous lookup joins. However, it does
> >>> not guarantee the processing order of records sharing the same key.
> >>>
> >>> As highlighted in [2], there are two key requirements for enhancing
> >>> async io operations:
> >>> 1. Ensuring the processing order of records with the same key is a
> >>> common requirement in DataStream.
> >>> 2. Sequential processing of records sharing the same upsertKey when
> >>> performing lookup join in Flink SQL is essential for maintaining
> >>> correctness.
> >>>
> >>> This optimization aims to balance correctness and performance for
> >>> stateful streaming workloads.Then the FLIP introduce a new operator
> >>> KeyedAsyncWaitOperator to supports the optimization. Besides, a new
> >>> option is added to control the behaviour avoid influencing existing
> >>> jobs.
> >>>
> >>> please find more details in the FLIP wiki document[1]. Looking forward
> >>> to your feedback.
> >>>
> >>> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-519%3A++Introduce+async+lookup+key+ordered+mode
> >>> [2] https://lists.apache.org/thread/wczzjhw8g0jcbs8lw2jhtrkw858cmx5n
> >>>
> >>> Best,
> >>> Xu Shuai
>
>

Re: [DISCUSS] FLIP-527 Support EXPLAIN EXECUTE STATEMENT SET Syntax in Flink SQL

2025-04-28 Thread Ron Liu

Hi, Ramin

Thanks for starting this proposal.

After glancing at the content of FLIP, I have a question, how valuable is
it to introduce this new syntax, if users want to EXPLAIN STATEMENT SET,
the EXPLAIN keyword itself has to be written,
how much does it cost to remove the EXECUTE keyword at this time? From this
point of view, I don't think it makes much sense to introduce this syntax.

Best,
Ron

David Radley  于2025年4月28日周一 21:51写道：

> Hi Ramin,
> Thanks for making the changes.
> As far as I can see the EXECUTE STATEMENT SET is Flink specific; so I am
> not sure this is part of the SQL standard. So, I guess adding the explain
> syntax as an alias for the existing explain would be ok.
>
> I assume Flink decided not to use stored procedures syntax to run the
> multiple inserts – which would be what I would have expected here.
>
>Kind regards David.
>
> From: Ramin Gharib 
> Date: Monday, 28 April 2025 at 13:41
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-527 Support EXPLAIN EXECUTE
> STATEMENT SET Syntax in Flink SQL
> Hello David,
>
> Many thanks for the response! I have moved the discussion to a new link.
> The FLIP number was wrong (now it is at 528). You can find the new
> link below
>
> https://lists.apache.org/thread/28llyjdks9hrj5rvpbxhd3zd7bfqwrw8
>
> I have fixed the inconsistency in the text.
>
> Thanks for raising the point about SQL standard compliance with EXPLAIN
> EXECUTE STATEMENT SET. I appreciate your concern.
>
> I've checked PostgreSQL, MySQL, SQL Server, and Oracle, and found that
> their EXPLAIN is designed for single queries. EXECUTE STATEMENT SET appears
> to be specific to Flink SQL.
>
> Best,
>
> On Mon, Apr 28, 2025 at 1:57 PM David Radley 
> wrote:
>
> > Hi Ramin,
> > Thanks for the Flip.
> >
> > In the text you say
> > EXECUTE STATEMENT SET
> > BEGIN
> >INSERT INTO `sales` SELECT ...;
> >INSERT INTO `products` SELECT ...;
> > END;
> >
> > Twice, I think you are intending the 2nd text to be different. Maybe
> > EXPLAIN STATEMENT SET
> >
> > I am worried that in the SQL standard EXECUTE and EXPLAIN are at the same
> > level when SQL parsing – so it would not be compliant SQL to allow
> EXPLAIN
> > EXECUTE. Is there precedence for this in another implementation / SQL
> > reference?
> >
> > Kind regards, David.
> >
> >
> > From: Ramin Gharib  on behalf of Ramin Gharib
> > 
> > Date: Monday, 28 April 2025 at 12:04
> > To: dev@flink.apache.org 
> > Subject: [EXTERNAL] [DISCUSS] FLIP-527 Support EXPLAIN EXECUTE STATEMENT
> > SET Syntax in Flink SQL
> > Hi Flink Community,
> > 
> >
> > I want to discuss FLIP-527: Support EXPLAIN EXECUTE STATEMENT SET Syntax
> in
> > Flink SQL.
> >
> > This FLIP proposes to extend the EXPLAIN statement in Flink SQL to
> support
> > the EXECUTE STATEMENT SET syntax. The goal is to provide users with a way
> > to understand the execution plan of a batch of SQL statements before they
> > are executed together. This will enhance debugging, optimization, and
> > overall understanding of how Flink processes a series of dependent
> > operations.
> > 
> >
> > You can find the full proposal details on the Flink Wiki at:
> > 
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-527%3A+Support+%60EXPLAIN+EXECUTE+STATEMENT+SET%60+Syntax+in+Flink+SQL
> >
> >
> > I'm eager to hear your thoughts, feedback, and potential concerns
> > regarding this proposal.
> >
> > Thanks,
> >
> >
> > Ramin Gharib
> >
> > [image: Confluent] 
> >
> > *Ramin Gharib *he/him/his
> >
> > Senior Software Engineer
> >
> > *Follow us: *[image: Blog]
> > <
> >
> https://www.confluent.io/blog?utm_source=footer&utm_medium=email&utm_campaign=ch.email-signature_type.community_content.blog
> > >[image:
> > Twitter] [image: LinkedIn]
> > [image: Slack]
> > [image: YouTube]
> > 
> >
> > [image: Try Confluent Cloud for Free]
> > <
> >
> https://www.confluent.io/get-started?utm_campaign=tm.fm-apac_cd.inbound&utm_source=gmail&utm_medium=organic
> > >
> >
> > Sent with Notion Mail 
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> > Winchester, Hampshire SO21 2JN
> >
>
>
> --
>
> [image: Confluent] 
> Ramin Gharib he/him/his
> Senior Software Engineer
> Follow us: [image: Blog]
> <
> https://www.confluent.io/blog?utm_source=footer&utm_medium=email&utm_campaign=ch.email-signature_type.community_content.blog
> >[image:
> Twitter] [image: LinkedIn]
> [image: Slack]
> [image: YouTube]
> 
>
> [image: Try Confluent Cloud f

Re: [DISCUSS] FLIP-516: Multi-Way Join Operator

2025-05-07 Thread Ron Liu

Hi, Gustavo

Sorry for the late participation in the FLIP discussion, this is a great
feature to solve the headache of the stream join, Big +1.

Regarding the new config option `table.optimizer.multi-join.enabled`, I
have a question: is this option only effective in streaming mode, what is
its behavior in batch mode?

Best,
Ron

Gustavo de Morais  于2025年5月8日周四 00:03写道：

> Hey Ferenc, that's a great callout. I'll make sure we add some
> documentation regarding general advice on when to use multi-way joins (pros
> and cons).
>
> Am Di., 6. Mai 2025 um 17:23 Uhr schrieb Ferenc Csaky
> :
>
> > Hi,
> >
> > I think the FLIP is in a fairly good state, +1 for the idea and the given
> > design. This may be considered already, but IMO we should also add some
> > high-level details, pros, and cons of enabling this feature to the
> website
> > other than the config option description.
> >
> > Best,
> > Ferenc
> >
> >
> >
> >
> > On Friday, May 2nd, 2025 at 14:47, Gustavo de Morais <
> > gustavopg...@gmail.com> wrote:
> >
> > >
> > >
> > > Hey everyone,
> > >
> > > I'd be great to start voting next week. Let me know if there are
> further
> > > questions or feedback.
> > >
> > > Thanks,
> > > Gustavo
> > >
> > > Am Mi., 30. Apr. 2025 um 15:07 Uhr schrieb Gustavo de Morais <
> > > gustavopg...@gmail.com>:
> > >
> > > > Hey Arvid and David, thanks for the feedback!
> > > >
> > > > The limitations are in the flip, I just had pasted a wrong link and
> > fixed
> > > > it. Let me know if there are other incorrect links.
> > > >
> > > > Yes, the thought of using statistics has potential. I've also spent
> > some
> > > > on that. The precise statistics required here would however be the
> > amount
> > > > of intermediate state/matches for each level and this is an
> > information we
> > > > only have at runtime/inside the operator. For that, we could look
> into
> > an
> > > > adaptive multi-way join in a next interaction and the user could
> > determine
> > > > a max amount of state he's willing to store. This has potential but
> > would
> > > > be a topic for a next FLIP, I added some information on that under
> the
> > > > rejected alternatives.
> > > >
> > > > Kind regards,
> > > > Gustavo
> > > >
> > > > Am Mo., 28. Apr. 2025 um 14:18 Uhr schrieb David Radley <
> > > > david_rad...@uk.ibm.com>:
> > > >
> > > > > Hi Gustavo,This sounds like a great idea.
> > > > > I notice the link limitations<
> > > > >
> >
> https://confluentinc.atlassian.net/wiki/spaces/FLINK/pages/4342875697/FLIP-516+Multi-Way+Join+Operator#Limitations
> > >
> > > > > in the Flip points outside of the document to something I do not
> have
> > > > > access to. Please could you include the limitations in the flip
> > itself.
> > > > >
> > > > > You mention re ordered binary joins might be less efficient by
> > turning
> > > > > them into a multi join. I wonder what the pros and cons are. I
> > wonder can
> > > > > we use statistics to decide whether we should do a multi way join?
> > In this
> > > > > case we could have an enum configuration something like:
> > > > > table.optimizer.join= binary-join, multi-join, auto.
> > > > >
> > > > > Kind regards, David.
> > > > >
> > > > > From: Arvid Heise ar...@apache.org
> > > > > Date: Monday, 28 April 2025 at 12:47
> > > > > To: dev@flink.apache.org dev@flink.apache.org
> > > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-516: Multi-Way Join Operator
> > > > > Hi Gustavo,
> > > > >
> > > > > the idea and approach LGTM. +1 to proceed.
> > > > >
> > > > > Best,
> > > > >
> > > > > Arvid
> > > > >
> > > > > On Thu, Apr 24, 2025 at 4:58 PM Gustavo de Morais <
> > gustavopg...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'd like to propose FLIP-516: Multi-Way Join Operator [1] for
> > > > > > discussion.
> > > > > >
> > > > > > Chained non-temporal joins in Flink SQL often cause a "big state
> > issue"
> > > > > > due
> > > > > > to large intermediate results, impacting performance and
> > stability. This
> > > > > > FLIP introduces a StreamingMultiJoinOperator to tackle this by
> > joining
> > > > > > multiple inputs (that need to share a common key) simultaneously
> > within
> > > > > > one
> > > > > > operator.
> > > > > >
> > > > > > The main goal is achieving zero intermediate state for these
> > common join
> > > > > > patterns, significantly reducing state size. This initial version
> > > > > > requires
> > > > > > a common partitioning key and focuses on INNER/LEFT joins, with
> > plans
> > > > > > for
> > > > > > future expansion. The operator is opt-in via
> > > > > > table.optimizer.multi-join.enabled (default false). PR with the
> > initial
> > > > > > version of the operator is available [2].
> > > > > >
> > > > > > Happy to be contributing to this community, and looking forward
> to
> > your
> > > > > > feedback and thoughts.
> > > > > >
> > > > > > Kind regards,
> > > > > > Gustavo de Morais
> > > > > >
> > > > > > [1]
> > > > >
> > > > >
> >
> https://

Re: [VOTE] FLIP-521: Integrating Variant Type into Flink: Enabling Efficient Semi-Structured Data Processing

2025-05-07 Thread Ron Liu

+1(binding)

Best,
Ron

weijie guo  于2025年5月7日周三 21:15写道：

> +1（binding)
>
> Best regards,
>
> Weijie
>
>
> Xintong Song  于2025年5月7日周三 19:37写道：
>
> > +1 (binding)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, May 7, 2025 at 7:00 PM Lincoln Lee 
> wrote:
> >
> > > +1 (binding)
> > >
> > >
> > > Best,
> > > Lincoln Lee
> > >
> > >
> > > Gyula Fóra  于2025年5月7日周三 18:43写道：
> > >
> > > > +1 (binding)
> > > >
> > > > Gyula
> > > >
> > > > On Wed, May 7, 2025 at 12:40 PM Timo Walther 
> > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for working on this!
> > > > >
> > > > > Best,
> > > > > Timo
> > > > >
> > > > > On 07.05.25 11:08, Xuannan Su wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'd like to start a vote on FLIP-521: Integrating Variant Type
> into
> > > > > > Flink: Enabling Efficient Semi-Structured Data Processing[1],
> which
> > > > > > has been discussed in this thread[2].
> > > > > >
> > > > > > The vote will be open for at least 72 hours unless there is an
> > > > objection
> > > > > > or not enough votes.
> > > > > >
> > > > > > Best,
> > > > > > Xuannan
> > > > > >
> > > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing#FLIP521:IntegratingVariantTypeintoFlink:EnablingEfficientSemiStructuredDataProcessing-Casting
> > > > > > [2]
> > https://lists.apache.org/thread/qlnhlsvdl6blcbxff700fz1tdk6ssmqg
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-18 Thread Ron Liu

Hi Xiangyu,

Thanks for your reply, the updates LGTM  overall.

1. Regarding the naming of the interface, what do you think about calling
it SupportsTargetColumnWriting? Here I would like to emphasize the support
for partial column writing, and I personally think the naming can be
aligned with SupportsWritingMetadata.

2. Regarding the interface methods, is it necessary to provide a default
implementation, do most of the stores support partial column writing?


Best,
Ron

Cong Cheng  于2025年2月18日周二 16:12写道：

> Hi Xiangyu,
>
> Introduce a new sink ability interface named `SupportsTargetColumnUpdate`,
> > this interface will tell the planner if the sink has considered the
> target
> > columns information in its implementation;
>
>
> I think it makes a lot of sense, +1 for this ability.
>
> Sorry to all that I sended the draft of the content twice, something
> wrong with the enter of my keyboard.
>
> Best,
> Cong Cheng
>
>
>
>
>
>
>
>
>
>
> xiangyu feng  于2025年2月18日周二 15:06写道：
>
> > Hi Kevin,
> >
> > Thx for ur valuable suggestion. I've made a few changes to current
> FLIP[1].
> >
> > 1, Introduce a new sink ability interface named
> > `SupportsTargetColumnUpdate`, this interface will tell the planner if the
> > sink has considered the target columns information in its implementation;
> >
> > 2, This ability will return true by default for safety consideration;
> >
> > 3, When generating node digest for sink reuse, the digest will only
> ignore
> > the target column infos when this ability return false. This will require
> > extra work for specific sink.
> >
> > By applying these changes, we can safely enable the sink reuse feature by
> > default even for sinks like JDBC . And for sinks like Paimon, we can also
> > reuse the sink node across multiple partial-update streams with different
> > target columns by revising paimon sink to implement this interface.
> >
> > Glad to hear you back for these updates.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
> >
> >
> >
> > Kevin Cheng  于2025年2月14日周五 16:13写道：
> >
> >> Hi Xiangyu and Ron,
> >>
> >> I agree that sink reuse can be enabled by default from Flink Planner
> >> perspective. But the planner should be informed by Sink Connector that
> >> whether the planner can reuse different sink with different target
> >> columns.
> >>
> >> Take JBDC sink as an example, under partial update circumstances, the
> JDBC
> >> needs to know the target sink or update columns of every record.
> However,
> >> the target columns info is discarded in current FLIP design.
> >>
> >> Best,
> >> Xiangyu
> >>
> >> xiangyu feng  于2025年2月14日周五 13:51写道：
> >>
> >> > Hi Ron,
> >> >
> >> > After second thought, taking sink reuse as a long awaited feature
> >> > with significant benefits for users, I agree to enable this in the
> first
> >> > place.  Similar features like `table.optimizer.reuse-sub-plan-enabled`
> >> and
> >> > `table.optimizer.reuse-source-enabled` are also enabled by default.
> From
> >> > this point of view, sink reuse should be the same.
> >> >
> >> > The Flip[1] has been revised accordingly. Thx for suggestion.
> >> >
> >> > [1]
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
> >> >
> >> > Regards,
> >> > Xiangyu
> >> >
> >> >
> >> >
> >> >
> >> > Ron Liu  于2025年2月14日周五 12:10写道：
> >> >
> >> > > Hi, Xiangyu
> >> > >
> >> > > >>> I prefer to set the default value of this option as'false in the
> >> > first
> >> > > place.  Setting as true might introduce unexpected behavior for
> users
> >> > when
> >> > > operating existing jobs. Maybe we should introduce this feature at
> >> first
> >> > > and discuss enabling this feature as default in a separated thread.
> >> WDYT?
> >> > >
> >> > > 1. What unexpected behaviors do you think this might introduce?  For
> >> Sink
> >> > > nodes, which are generally stateless, I intuitively understand that
> no
> >> > > state compatibility issues will be int

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-19 Thread Ron Liu

Hi, Xiangyu

Thaks for updates, LGTM

Best,
Ron

xiangyu feng  于2025年2月19日周三 17:13写道：

> Hi Ron,
>
> Thx for ur advice.  I've made the changes to current FLIP[1] including
> renaming the interface and remove the default implementation. As we have
> discussed, the target columns will be compared in sink reuse if the sink
> has not implemented the `SupportsTargetColumnWriting` ability. This will
> make sure the sink reuse feature can still be safely enabled by default.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>
> Best,
> Xiangyu Feng
>
> Ron Liu  于2025年2月19日周三 10:25写道：
>
>> Hi Xiangyu,
>>
>> Thanks for your reply, the updates LGTM  overall.
>>
>> 1. Regarding the naming of the interface, what do you think about calling
>> it SupportsTargetColumnWriting? Here I would like to emphasize the support
>> for partial column writing, and I personally think the naming can be
>> aligned with SupportsWritingMetadata.
>>
>> 2. Regarding the interface methods, is it necessary to provide a default
>> implementation, do most of the stores support partial column writing?
>>
>>
>> Best,
>> Ron
>>
>> Cong Cheng  于2025年2月18日周二 16:12写道：
>>
>>> Hi Xiangyu,
>>>
>>> Introduce a new sink ability interface named
>>> `SupportsTargetColumnUpdate`,
>>> > this interface will tell the planner if the sink has considered the
>>> target
>>> > columns information in its implementation;
>>>
>>>
>>> I think it makes a lot of sense, +1 for this ability.
>>>
>>> Sorry to all that I sended the draft of the content twice, something
>>> wrong with the enter of my keyboard.
>>>
>>> Best,
>>> Cong Cheng
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> xiangyu feng  于2025年2月18日周二 15:06写道：
>>>
>>> > Hi Kevin,
>>> >
>>> > Thx for ur valuable suggestion. I've made a few changes to current
>>> FLIP[1].
>>> >
>>> > 1, Introduce a new sink ability interface named
>>> > `SupportsTargetColumnUpdate`, this interface will tell the planner if
>>> the
>>> > sink has considered the target columns information in its
>>> implementation;
>>> >
>>> > 2, This ability will return true by default for safety consideration;
>>> >
>>> > 3, When generating node digest for sink reuse, the digest will only
>>> ignore
>>> > the target column infos when this ability return false. This will
>>> require
>>> > extra work for specific sink.
>>> >
>>> > By applying these changes, we can safely enable the sink reuse feature
>>> by
>>> > default even for sinks like JDBC . And for sinks like Paimon, we can
>>> also
>>> > reuse the sink node across multiple partial-update streams with
>>> different
>>> > target columns by revising paimon sink to implement this interface.
>>> >
>>> > Glad to hear you back for these updates.
>>> >
>>> > [1]
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>>> >
>>> >
>>> >
>>> > Kevin Cheng  于2025年2月14日周五 16:13写道：
>>> >
>>> >> Hi Xiangyu and Ron,
>>> >>
>>> >> I agree that sink reuse can be enabled by default from Flink Planner
>>> >> perspective. But the planner should be informed by Sink Connector that
>>> >> whether the planner can reuse different sink with different target
>>> >> columns.
>>> >>
>>> >> Take JBDC sink as an example, under partial update circumstances, the
>>> JDBC
>>> >> needs to know the target sink or update columns of every record.
>>> However,
>>> >> the target columns info is discarded in current FLIP design.
>>> >>
>>> >> Best,
>>> >> Xiangyu
>>> >>
>>> >> xiangyu feng  于2025年2月14日周五 13:51写道：
>>> >>
>>> >> > Hi Ron,
>>> >> >
>>> >> > After second thought, taking sink reuse as a long awaited feature
>>> >> > with significant benefits for users, I agree to enable this in the
>>> first
>>> >> > place.  Simil

Re: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-03-10 Thread Ron Liu

+1(binding)

Best,
Ron

Jingsong Li  于2025年3月10日周一 15:44写道：

> +1
>
> On Mon, Mar 10, 2025 at 1:56 PM Lincoln Lee 
> wrote:
> >
> > +1 (binding)
> >
> >
> > Best,
> > Lincoln Lee
> >
> >
> > xiangyu feng  于2025年3月10日周一 13:33写道：
> >
> > > Hi devs,
> > >
> > > All comments in the discussion thread[1] have been resolved. I would
> like
> > > to proceed this voting process.
> > >
> > > [1] https://lists.apache.org/thread/r1wo9sf3d1725fhwzrttvv56k4rc782m
> > >
> > > Regards,
> > > Xiangyu Feng
> > >
> > > Leonard Xu  于2025年3月10日周一 12:01写道：
> > >
> > > > +1 (binding)
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > > > 2025年2月25日 10:12，weijie guo  写道：
> > > > >
> > > > > +1(binding)
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Weijie
> > > > >
> > > > >
> > > > > Zhanghao Chen  于2025年2月23日周日 16:36写道：
> > > > >
> > > > >> +1 (non-binding)
> > > > >>
> > > > >> Thanks for driving this. It's a nice useability improvement for
> > > > performing
> > > > >> partial-updates on datalakes.
> > > > >>
> > > > >>
> > > > >> Best,
> > > > >> Zhanghao Chen
> > > > >> 
> > > > >> From: xiangyu feng 
> > > > >> Sent: Sunday, February 23, 2025 10:44
> > > > >> To: dev@flink.apache.org 
> > > > >> Subject: [VOTE] FLIP-506: Support Reuse Multiple Table Sinks in
> > > Planner
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >> I would like to start the vote for FLIP-506: Support Reuse
> Multiple
> > > > Table
> > > > >> Sinks in Planner[1].
> > > > >> This FLIP was discussed in this thread [2].
> > > > >>
> > > > >> The vote will be open for at least 72 hours unless there is an
> > > > objection or
> > > > >> insufficient votes.
> > > > >>
> > > > >> [1]
> > > > >>
> > > > >>
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
> > > > >> [2]
> https://lists.apache.org/thread/r1wo9sf3d1725fhwzrttvv56k4rc782m
> > > > >>
> > > > >> Regards,
> > > > >> Xiangyu Feng
> > > > >>
> > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-13 Thread Ron Liu

Hi, Xiangyu

>>> I prefer to set the default value of this option as'false in the first
place.  Setting as true might introduce unexpected behavior for users when
operating existing jobs. Maybe we should introduce this feature at first
and discuss enabling this feature as default in a separated thread. WDYT?

1. What unexpected behaviors do you think this might introduce?  For Sink
nodes, which are generally stateless, I intuitively understand that no
state compatibility issues will be introduced after Sink reuse.

2. Since Sink reuse benefits users, why not enable this feature by default
on the first day it is introduced? If your concern is potential unhandled
corner cases in the implementation, I consider those to be bugs. We should
prioritize fixing them rather than blocking the default enablement of this
optimization.

3. If we don't enable it by default now, when should we? What specific
milestones or actions are needed during the waiting period?  Your concerns
about unintended behaviors would still exist even if we enable it later.
Why delay resolving this in a separate discussion instead of finalizing it
here?

4. From our internal practice, users still want to enjoy the benefits of
this feature by default.


Best,
Ron

xiangyu feng  于2025年2月13日周四 15:57写道：

>  Hi Ron,
>
> Thx for quick response.
>
> - should the default value be true for the newly introduced option
> `table.optimizer.reuse-sink-enabled`?
>
> I prefer to set the default value of this option as'false in the first
> place.  Setting as true might introduce unexpected behavior for users when
> operating existing jobs. Maybe we should introduce this feature at first
> and discuss enabling this feature as default in a separated thread. WDYT?
>
> - have you considered the technical implementation options and are they
> feasible?
>
> Yes, we have already implemented the POC internally. It works well.
>
> Looking forward for your feedback.
>
> Best,
> Xiangyu
>
> Ron Liu  于2025年2月13日周四 14:55写道：
>
> > Hi, Xiangyu
> >
> > Thank you for proposing this FLIP, it's great work and looks very useful
> > for users.
> >
> > I have the following two questions regarding the content of the FLIP:
> > 1. Since sink reuse is very useful, should the default value be true for
> > the newly introduced option `table.optimizer.reuse-sink-enabled`, and
> > should the engine enable this optimization by default. Currently for
> source
> > reuse, the default value of  `sql.optimizer.reuse.table-source.enabled`
> > option is also true, which does not require user access by default, so I
> > think the engine should turn on Sink reuse optimization by default.
> > 2. Regarding Sink Digest, you mentioned disregarding the sink target
> > column, which I think is a very good suggestion, and very useful if it
> can
> > be done. I have a question: have you considered the technical
> > implementation options and are they feasible?
> >
> > Best,
> > Ron
> >
> > xiangyu feng  于2025年2月13日周四 12:56写道：
> >
> > > Hi all,
> > >
> > > Thank you all for the comments.
> > >
> > > If there is no further comment, I will open the voting thread in 3
> days.
> > >
> > > Regards,
> > > Xiangyu
> > >
> > > xiangyu feng  于2025年2月11日周二 14:17写道：
> > >
> > > > Link for Paimon LocalMerge Operator[1]
> > > >
> > > > [1]
> > > >
> > >
> >
> https://paimon.apache.org/docs/master/maintenance/write-performance/#local-merging
> > > >
> > > > xiangyu feng  于2025年2月11日周二 14:03写道：
> > > >
> > > >> Follow the above,
> > > >>
> > > >> "And for SinkWriter, the data structure to be processed should be
> > > fixed."
> > > >>
> > > >> I'm not very sure why the data structure of SinkWriter should be
> > fixed.
> > > >> Can you elaborate the scenario here?
> > > >>
> > > >>  "Is there a node or an operator to fill in the inconsistent field
> of
> > > >> Rowdata that passed from different Sources?"
> > > >>
> > > >> By `filling in the inconsistent field from different sources`, do
> you
> > > >> refer to implementations like the LocalMerge Operator [1] for
> Paimon?
> > > IMHO,
> > > >> this should not be included in the Sink Reuse. The merging behavior
> of
> > > >> multiple sources should be considered inside of the sink.
> > > >>
> > > >> Regards,
> > > >> Xiangyu Feng

Re: [ANNOUNCE] New Apache Flink Committer - Xuyang

2025-02-19 Thread Ron Liu

Congratulations, you deserve it.

Best,
Ron


Lincoln Lee  于2025年2月20日周四 09:50写道：

> Hi everyone,
>
> On behalf of the PMC, I'm happy to announce that Xuyang has become a
> new Flink Committer!
>
> Xuyang has been contributing to the Flink community since Sep 15, 2021, he
> has
> driven and contributed to 5 FLIPs, submitted over 100 commits and more than
> 100,000 lines of code changes.
>
> He primarily focuses on the table-related modules. He has completed the
> support
> or SQL Join Hints, advancing the integration of SQL operators with the new
> disaggregated state, and also addressing technical debt, including
> improving
> TVF window functionality and performing extensive code cleanup for
> deprecated
> APIs in the table module for version 2.0. Additionally, he's very active in
> the mailing
> list, answering and resolving user issues, and participating in community
> discussions.
>
> Please join me in congratulating Xuyang for becoming an Apache Flink
> committer.
>
>
> Cheers,
> Lincoln Lee (On behalf of the Flink PMC)
>

Re: [ANNOUNCE] New Apache Flink Committer - Feng Jin

2025-02-20 Thread Ron Liu

Congratulations

Best,
Ron

Yuepeng Pan  于2025年2月21日周五 09:57写道：

> Congratulations !
>
>
> Best,
> Yuepeng.
>
>
>
>
>
> At 2025-02-21 09:52:46, "Lincoln Lee"  wrote:
> >Hi everyone,
> >
> >On behalf of the PMC, I'm happy to announce that Feng Jin has become a
> >new Flink Committer!
> >
> >Feng Jin started contributing to the Apache Flink project on Nov 7, 2020
> >and
> >has been active for two years, driving and contributing to 6 FLIPs,
> >completed
> >70+ commits with tens of thousands of lines of code changes.
> >
> >In the evolution of Flink SQL new features, he has contributed to support
> >the
> >new syntax of Time Travel, participated in the development of Call
> >Procedure
> >and further supported named parameters. For the recent releases, he has
> >been continuously improving the functionalities of the new Materialized
> >Table.
> >Meanwhile, he is also active in mailing lists, participating in
> discussions
> >and
> >help answering user questions.
> >
> >Please join me in congratulating Feng Jin for becoming an Apache Flink
> >committer.
> >
> >
> >Cheers,
> >Lincoln Lee (On behalf of the Flink PMC)
>

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

2025-02-24 Thread Ron Liu

Hi, Xiangyu

Thanks for pinging me, the interface name looks good to me.


Best,
Ron

xiangyu feng  于2025年2月24日周一 17:59写道：

> Hi Lincoln,
>
> Thx for suggestion.
>
> -- 1st, for the sql example in the Motivation part, why is cast nulls
> included in the select clause after union all multiple inputs?
>
> I do have misplaced the CAST NULLs, revised in the FLIP[1].
>
> -- 2nd, simplifying the multi partial-insert example, what would the
> equivalent sql look like to the user after applying the optimization
> provided by this FLIP?
> ```
> INSERT INTO sink(pk, f1) SELECT ... FROM table1;
> INSERT INTO sink(pk, f2) SELECT ... FROM table2;
> INSERT INTO sink(pk, f3) SELECT ... FROM table3;
> ```
>
> Actually, this is already the simplified SQL here. The optimization here
> is that, with this SQL, user can submit a united datastream job writing to
> the sink table without concerns for concurrency issues.
>
> -- 3rd, for sink digest description in proposed changes, IIUC, it should
> be `b` for `not include` and `c` for `include`?
>
> I think there is an ambiguity for the previous naming of the interface:
> `SupportsTargetColumnWriting`. The name was originally meaning that the
> target columns has been **used** in the sink so the sink digest should
> **include** this information.
> After second thought, I've changed the naming of this interface to
> `SupportsTargetColumnReusing` to remove this ambiguity in the FLIP[1]. When
> this ability is enabled, the sink can support reuse with different target
> columns which means the sink digest should  **not include** this
> information.
>
> @Lincoln, looking forward to hear you back for the response.
>
> Also, @Ron Liu @Weijie Guo I would like to hear more from you about this
> new interface naming.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>
> Regards,
> Xiangyu Feng
>
>
> Lincoln Lee  于2025年2月23日周日 22:43写道：
>
>> Hi xiangyu,
>>
>> Sorry for my late reply! I have some questions for the FLIP:
>>
>> 1st, for the sql example in the Motivation part, why is cast nulls
>> included
>> in
>> the select clause after union all multiple inputs? Related to the
>> partial-insert
>> example later, should the cast nulls be in the select clause inside the
>> union all?
>> ```
>> -- Flink SQL
>> INSERT INTO sink
>> SELECT
>> id1,
>> CAST(NULL AS STRING) AS f1,
>> CAST(NULL AS STRING) AS f2,
>> CAST(NULL AS STRING) AS f3,
>> CAST(NULL AS STRING) AS f4,
>> CAST(NULL AS STRING) AS f5,
>> CAST(NULL AS STRING) AS f6,
>> CAST(NULL AS STRING) AS f7,
>> CAST(NULL AS STRING) AS f8,
>> CAST(NULL AS STRING) AS f9,
>> ... ...
>> FROM (
>> SELECT  ... ...
>> FROMtable1
>> UNION ALL
>> SELECT  ... ...
>> FROMtable2
>> ```
>>
>> 2nd, simplifying the multi partial-insert example, what would the
>> equivalent
>> sql look like to the user after applying the optimization provided by this
>> FLIP?
>> ```
>> INSERT INTO sink(pk, f1) SELECT ... FROM table1;
>> INSERT INTO sink(pk, f2) SELECT ... FROM table2;
>> INSERT INTO sink(pk, f3) SELECT ... FROM table3;
>> ```
>>
>> 3rd, for sink digest description in proposed changes, IIUC, it should be
>> `b` for `not include` and `c` for `include`?
>> ```
>>
>> Factors *considered* for sink node digest depends on circumstance:
>>
>>1. sink target columns
>>1. The sink node digest will* include* the target columns if the sink
>>   has not implement the target column writing ability interface.
>>   2. The sink node digest will* include* the target columns when the
>>   sink has enabled the target column writing ability
>>   3. The sink node digest will* not include* the target columns when
>>   the sink has not enabled the target column writing ability
>>
>> ```
>>
>>
>> Best,
>> Lincoln Lee
>>
>>
>> xiangyu feng  于2025年2月20日周四 09:41写道：
>>
>> > Hi all,
>> >
>> > Thank you all for the comments.
>> >
>> > If there is no further comment, I will open the voting thread in 3 days.
>> >
>> > Regards,
>> > Xiangyu
>> >
>> >
>> > Ron Liu 于2025年2月19日 周三17:58写道：
>> >
>> > > Hi, Xiangyu
>> > >
>> > > Thaks for updates, LGTM
>> > >
>> > > Best,
>> > > Ron

[DISCUSS] Planning Flink 2.1

2025-03-24 Thread Ron Liu

Hi everyone,
With the release announcement of Flink 2.0, it's a good time to kick off
discussion of the next release 2.1.

- Release Managers

I'd like to volunteer as one of the release managers this time. It has been
good practice to have a team of release managers from different
backgrounds, so please raise you hand if you'd like to volunteer and get
involved.


- Timeline

Flink 2.0 has been released. With a target release cycle of 4 months,
we propose a feature freeze date of *June 21, 2025*.


- Collecting Features

As usual, we've created a wiki page[1] for collecting new features in 2.1.

In the meantime, the release management team will be finalized in the next
few days, and we'll continue to create Jira Boards and Sync meetings
to make it easy for everyone to get an overview and track progress.


Best,
Ron

[1] https://cwiki.apache.org/confluence/display/FLINK/2.1+Release

Re: [VOTE]FLIP-519: Introduce async lookup key ordered mode

2025-05-11 Thread Ron Liu

+1(binding)

Best,
Ron

Lincoln Lee  于2025年5月12日周一 09:55写道：

> +1 (binding)
>
>
> Best,
> Lincoln Lee
>
>
> shuai xu  于2025年5月9日周五 17:46写道：
>
> > Hi devs,
> >
> > Thank you for providing feedback on FLIP-519: Introduce async lookup key
> > ordered mode[1].
> >  I'd like to start a vote on this FLIP. Here is the discussion thread[2].
> >
> >
> > The vote will be open for at least 72 hours unless there is an objection
> > or insufficient votes.
> >
> > Best,
> > Xu Shuai
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-519%3A++Introduce+async+lookup+key+ordered+mode
> > [2] https://lists.apache.org/thread/z3th724l6vnylgv601gvcbdy4oy2wy7r
>

Re: [VOTE] FLIP-516: Multi-Way Join Operator

2025-05-11 Thread Ron Liu

+1 (binding)

Best,
Ron

Hao Li  于2025年5月10日周六 02:35写道：

> +1 (non-binding)
>
> Thanks,
> Hao
>
> On Fri, May 9, 2025 at 3:41 AM Arvid Heise  wrote:
>
> > +1 (binding)
> >
> > Cheers
> >
> > On Wed, May 7, 2025 at 6:37 PM Gustavo de Morais  >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I'd like to start voting on FLIP-516: Multi-Way Join Operator [1]. The
> > > discussion can be found in this thread [2].
> > > The vote will be open for at least 72 hours, unless there is an
> objection
> > > or not enough votes.
> > >
> > > Thanks,
> > >
> > > Gustavo de Morais
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-516%3A+Multi-Way+Join+Operator
> > >
> > > [2] https://lists.apache.org/thread/9b9cy8o2qjt7w2n7j0g4bbrwvy9n61xv
> > >
> >
>

Re: Re: [DISCUSS] FLIP-486: Introduce a new DeltaJoin

2025-05-07 Thread Ron Liu

Hi, Xuyang

Thanks for initiating this FLIP, which is of great value in solving the
Streaming Join that troubles many users. Big +1 for it.

For the overall design of FLIP, I have the following questions:

1. Can you explain the currently supported Join types in FLIP, such as
Inner, Left, and Right Join?

2. You mentioned strong consistency semantics and eventual consistency
semantics, and gave the corresponding derivation formulas. Under eventual
consistency semantics,
you mentioned that some intermediate data will be introduced. What confuses
me here is how to achieve eventual consistency semantics if Sink does not
support idempotent updates by Join key, after all, the Join operator
outputs some duplicate data.

3. Due to storage reasons, the first step is to only consider supporting
eventual consistency semantics, but Paimon and other lake storages support
Snapshot, which can achieve strong consistency semantics. If strong
consistency is considered in the future, can it be well implemented based
on the current design?

4. Regarding the LRU cache part, if the record in the source table is
updated, how to automatically update the cache, otherwise I feel that the
wrong data may be found from the cache.

5. In terms of implementation, have you considered batching input records
and then querying them in batches to improve operator performance?

6. Since Flink SQL syntax does not support indexes now, how can you create
indexes on the source table and pass them to Flink? I mean, how can users
use Delta Join?

7. FLIP-516 solves the Streaming Join problem via Multi-Way Join. When both
Join optimizations are enabled, which one will take precedence, Delta Join
or Multi-Way Join? What is the relationship between the two?

Best,
Ron

Xuyang  于2025年5月7日周三 10:02写道：

> Hi, Weijie.
>
> Thanks for your review! Let me try to answer these questions.
>
> 1. The nodes between source and join must be included in the whitelist,
> such as Calc,
>
> DropUpdateBefore, WatermarkAssigner, Exchange and etc. You can see more in
> chapter
>
> "Limitations about Delta Join with Eventual Consistency".
>
> 2. Currently, we have only one vague plan. First and foremost, we need to
> ensure that storage
>
> engines like Paimon, Fluss and etcs support snapshot awareness during
> scans and including
>
> snapshots during lookups. Overall, it is likely to be supported in 2.2/2.3
> later...
>
>
>
>
> --
>
> Best！
> Xuyang
>
>
>
>
>
> At 2025-04-29 11:53:48, "weijie guo"  wrote:
> >Hi Xuyang,
> >
> >Thanks for driving this! The state of two-stream join is a very
> >headache-inducing problem in stream computing system.
> >
> >After reading this FLIP, I have two question:
> >
> >1. Can the two sides of the join operator be any stream or must it be
> >immediately followed by the LookupSource? If it is any stream, how should
> >the stateful operators in the path be handled?
> >2. Are there any plans to support strong consistency in the future? This
> >should be helpful for the scenarios of incremental computing.
> >
> >
> >
> >Best regards,
> >
> >Weijie
> >
> >
> >Xuyang  于2025年4月25日周五 10:52写道：
> >
> >> Hi, devs.
> >>
> >> I'd like to start a discussion on FLIP-486: Introduce a new
> DeltaJoin[1].
> >>
> >>
> >>
> >>
> >> In Flink streaming jobs, the large state of Join nodes has been a
> >> persistent concern for users.
> >>
> >> Since streaming jobs are long-running, the state of join generally
> >> increases in size over time.
> >>
> >> Although users can set the state TTL to mitigate this issue, it is not
> >> applicable to all scenarios
> >>
> >> and does not provide a fundamental solution.
> >>
> >> An oversized state can lead to a series of problems, including but not
> >> limited to:
> >>
> >> 1. Resource bottlenecks in individual tasks
> >>
> >> 2. Slow checkpointing, which affects job stability during the
> >> checkpointing process
> >>
> >> 3. Long recovery time from state
> >>
> >> In addition, when analyzing the join state, we find that in some
> >> scenarios, the state within the join
> >>
> >> actually contains redundant data from the source tables.
> >>
> >>
> >>
> >>
> >> To address this issue, we aim to introduce Delta Join, which is based on
> >> the core idea of leveraging
> >>
> >> a bidirectional lookup join approach to reuse source table data as a
> >> substitute for join state.
> >>
> >>
> >>
> >>
> >> You can find more details in this Flip. I'm looking forward to your
> >> comments and feedback.
> >>
> >>
> >>
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-486%3A+Introduce+A+New+DeltaJoin
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Best！
> >> Xuyang
>

1 2 >

1 - 100 of 141 matches

Mail list logo