Re: [DISCUSS] FLIP-500: Support Join Extension in DataStream V2 API

Xu Huang Mon, 06 Jan 2025 03:36:00 -0800

Hi, Fabian.

Thanks for your question, let me elaborate:


First of all what we provide is not Join operation in relational algebra,
but the ability to join/combine two data streams. There will always be
cases where the user's inputs cannot be represented simply by the table
abstraction, but we can join them by some field of record on the streams.

Secondly, what we provide is just an extension and not the base/primitive
semantics of DataStream V2(see the selection of "Fundamental primitives"
and "High-Level Extensions" in FLIP-408[1]). I don't think this will
confuse users, providing such an optional extension is more like syntactic
sugar around the base api for users, even if we don't provide it, users
will find a way to do something similar.


Finally, the deprecated Flink DataSet API does support Join, DataStream V2
carries the ability of DataStream V1 and DataSet. So we thought it would
make sense to provide this capability in the form of an extension rather
than a base API.


Best

Xu Huang

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789682#FLIP408:[Umbrella]IntroduceDataStreamAPIV2-Fundamentalprimitives

Fabian Paul <[email protected]> 于2025年1月6日周一 16:59写道：

> Hi Xu,
>
> Thanks for drafting the FLIP. I have a question regarding the
> motivation for the change.
>
> So far, the datastream API doesn't support any relational operations,
> and if users want to use joins/groupBy etc., they usually use SQL or
> the table API.
> With this FLIP, the difference between the APIs becomes more blurry,
> and users might be confused about the different join semantics of the
> API layers.
>
> I would like to know the arguments behind introducing the statements
> to the datastream API against using the table API. It should also be
> part of the FLIP to understand better how this change helps Flink
> users.
>
> Best,
> Fabian
>
>
> On Mon, Jan 6, 2025 at 9:25 AM Junrui Lee <[email protected]> wrote:
> >
> > Hi Xu,
> >
> > Thanks for your work. I have a small question: In JoinExtension, when a
> > record is received, is it immediately joined with all the data received
> > from the other side, or does it wait until both streams are finished
> before
> > joining? Also, does it work on both bounded and unbounded streams?
> >
> > Xu Huang <[email protected]> 于2025年1月4日周六 11:50写道：
> >
> > > Hi Devs,
> > >
> > > Weijie Guo and I would like to initiate a discussion about FLIP-500:
> > > Support Join Extension in DataStream V2 API [1].
> > >
> > > In relational algebra, Join are used to co-group two datasets and
> combine
> > > the data based on specific conditions. For stream computing systems,
> the
> > > data of the two streams is cached (usually through State) when the Join
> > > operation is performed. When data from either stream arrives, it can be
> > > matched with data from another stream. Therefore, Join has been widely
> used
> > > in multi-stream aggregate scenarios.
> > >
> > > To make it easy for users to use Join in DataStream V2, this FLIP will
> > > implement the Join extension in DataStream V2.
> > >
> > > For more details, please refer to FLIP-500 [1]. We look forward to your
> > > feedback.
> > >
> > >
> > > Best,
> > >
> > > Xu Huang
> > >
> > >
> > > [1] https://cwiki.apache.org/confluence/x/ywz0Ew
> > >
>

Re: [DISCUSS] FLIP-500: Support Join Extension in DataStream V2 API

Reply via email to