Hi Huang,

I agree with Fabian that we should be careful with adding features to DataStream API V2. Providing it as an extension rather than a base API is definitely a good direction. Prepared built-in functions and algorithms should rather go into a lib module.

Nevertheless, the DataStream API should focus on the primitives for stream processing. I'm not sure if the join implementation in DataStream API V1 was ever useful. Instead of digging into details of the join implementation or custom triggers, the past has shown that it is often easier for users to just understand ProcessFunction and configure it the way they like. Streaming joins are usually highly customized and use case specific.

Regards,
Timo




On 06.01.25 12:35, Xu Huang wrote:
Hi, Fabian.

Thanks for your question, let me elaborate:

First of all what we provide is not Join operation in relational algebra,
but the ability to join/combine two data streams. There will always be
cases where the user's inputs cannot be represented simply by the table
abstraction, but we can join them by some field of record on the streams.

Secondly, what we provide is just an extension and not the base/primitive
semantics of DataStream V2(see the selection of "Fundamental primitives"
and "High-Level Extensions" in FLIP-408[1]). I don't think this will
confuse users, providing such an optional extension is more like syntactic
sugar around the base api for users, even if we don't provide it, users
will find a way to do something similar.


Finally, the deprecated Flink DataSet API does support Join, DataStream V2
carries the ability of DataStream V1 and DataSet. So we thought it would
make sense to provide this capability in the form of an extension rather
than a base API.


Best

Xu Huang

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789682#FLIP408:[Umbrella]IntroduceDataStreamAPIV2-Fundamentalprimitives

Fabian Paul <fp...@apache.org> 于2025年1月6日周一 16:59写道:

Hi Xu,

Thanks for drafting the FLIP. I have a question regarding the
motivation for the change.

So far, the datastream API doesn't support any relational operations,
and if users want to use joins/groupBy etc., they usually use SQL or
the table API.
With this FLIP, the difference between the APIs becomes more blurry,
and users might be confused about the different join semantics of the
API layers.

I would like to know the arguments behind introducing the statements
to the datastream API against using the table API. It should also be
part of the FLIP to understand better how this change helps Flink
users.

Best,
Fabian


On Mon, Jan 6, 2025 at 9:25 AM Junrui Lee <jrlee....@gmail.com> wrote:

Hi Xu,

Thanks for your work. I have a small question: In JoinExtension, when a
record is received, is it immediately joined with all the data received
from the other side, or does it wait until both streams are finished
before
joining? Also, does it work on both bounded and unbounded streams?

Xu Huang <huangxu.wal...@gmail.com> 于2025年1月4日周六 11:50写道:

Hi Devs,

Weijie Guo and I would like to initiate a discussion about FLIP-500:
Support Join Extension in DataStream V2 API [1].

In relational algebra, Join are used to co-group two datasets and
combine
the data based on specific conditions. For stream computing systems,
the
data of the two streams is cached (usually through State) when the Join
operation is performed. When data from either stream arrives, it can be
matched with data from another stream. Therefore, Join has been widely
used
in multi-stream aggregate scenarios.

To make it easy for users to use Join in DataStream V2, this FLIP will
implement the Join extension in DataStream V2.

For more details, please refer to FLIP-500 [1]. We look forward to your
feedback.


Best,

Xu Huang


[1] https://cwiki.apache.org/confluence/x/ywz0Ew




Reply via email to