Re: [DISCUSS] FLIP-500: Support Join Extension in DataStream V2 API

Timo Walther Tue, 07 Jan 2025 02:38:10 -0800

Hi Huang,

I agree with Fabian that we should be careful with adding features toDataStream API V2. Providing it as an extension rather than a base APIis definitely a good direction. Prepared built-in functions andalgorithms should rather go into a lib module.

Nevertheless, the DataStream API should focus on the primitives forstream processing. I'm not sure if the join implementation in DataStreamAPI V1 was ever useful. Instead of digging into details of the joinimplementation or custom triggers, the past has shown that it is ofteneasier for users to just understand ProcessFunction and configure it theway they like. Streaming joins are usually highly customized and usecase specific.


Regards,
Timo




On 06.01.25 12:35, Xu Huang wrote:

Hi, Fabian.

Thanks for your question, let me elaborate:

First of all what we provide is not Join operation in relational algebra,
but the ability to join/combine two data streams. There will always be
cases where the user's inputs cannot be represented simply by the table
abstraction, but we can join them by some field of record on the streams.

Secondly, what we provide is just an extension and not the base/primitive
semantics of DataStream V2(see the selection of "Fundamental primitives"
and "High-Level Extensions" in FLIP-408[1]). I don't think this will
confuse users, providing such an optional extension is more like syntactic
sugar around the base api for users, even if we don't provide it, users
will find a way to do something similar.


Finally, the deprecated Flink DataSet API does support Join, DataStream V2
carries the ability of DataStream V1 and DataSet. So we thought it would
make sense to provide this capability in the form of an extension rather
than a base API.


Best

Xu Huang

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789682#FLIP408:[Umbrella]IntroduceDataStreamAPIV2-Fundamentalprimitives

Fabian Paul <fp...@apache.org> 于2025年1月6日周一 16:59写道：

Hi Xu,

Thanks for drafting the FLIP. I have a question regarding the
motivation for the change.

So far, the datastream API doesn't support any relational operations,
and if users want to use joins/groupBy etc., they usually use SQL or
the table API.
With this FLIP, the difference between the APIs becomes more blurry,
and users might be confused about the different join semantics of the
API layers.

I would like to know the arguments behind introducing the statements
to the datastream API against using the table API. It should also be
part of the FLIP to understand better how this change helps Flink
users.

Best,
Fabian


On Mon, Jan 6, 2025 at 9:25 AM Junrui Lee <jrlee....@gmail.com> wrote:


Hi Xu,

Thanks for your work. I have a small question: In JoinExtension, when a
record is received, is it immediately joined with all the data received
from the other side, or does it wait until both streams are finished

before

joining? Also, does it work on both bounded and unbounded streams?

Xu Huang <huangxu.wal...@gmail.com> 于2025年1月4日周六 11:50写道：

Hi Devs,

Weijie Guo and I would like to initiate a discussion about FLIP-500:
Support Join Extension in DataStream V2 API [1].

In relational algebra, Join are used to co-group two datasets and

combine

the data based on specific conditions. For stream computing systems,

the

data of the two streams is cached (usually through State) when the Join
operation is performed. When data from either stream arrives, it can be
matched with data from another stream. Therefore, Join has been widely

used

in multi-stream aggregate scenarios.

To make it easy for users to use Join in DataStream V2, this FLIP will
implement the Join extension in DataStream V2.

For more details, please refer to FLIP-500 [1]. We look forward to your
feedback.


Best,

Xu Huang


[1] https://cwiki.apache.org/confluence/x/ywz0Ew

Re: [DISCUSS] FLIP-500: Support Join Extension in DataStream V2 API

Reply via email to