Re: [PR] [WIP][FLINK-37148][docs] Add documents for DataStream V2 [flink]

via GitHub Wed, 22 Jan 2025 23:57:34 -0800


reswqa commented on code in PR #26057:
URL: https://github.com/apache/flink/pull/26057#discussion_r1926520983



##########
docs/content/docs/dev/datastream-v2/building_blocks.md:
##########
@@ -0,0 +1,215 @@
+---
+title: Building Blocks
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Building Blocks
+
+DataStream, Partitioning, ProcessFunction are the most fundamental elements of 
DataStream API and respectively represent:
+
+- What are the types of data streams
+
+- How data is partitioned
+
+- How to perform operations / processing on data streams
+
+They are also the core parts of the fundamental primitives provided by 
DataStream API.
+
+## DataStream
+
+Data flows on the stream may be divided into multiple partitions. According to 
how the data is partitioned on the stream, we divide it into the following 
categories:
+
+- Global Stream: Force single partition/parallelism, and the correctness of 
data depends on this.
+
+- Partition Stream:
+  - Divide data into multiple partitions. State is only available within the 
partition. One partition can only be processed by one task, but one task can 
handle one or multiple partitions.
+According to the partitioning approach, it can be further divided into the 
following two categories:
+
+    - Keyed Partition Stream: Each key is a partition, and the partition to 
which the data belongs is deterministic.
+
+    - Non-Keyed Partition Stream: Each parallelism is a partition, and the 
partition to which the data belongs is nondeterministic.
+
+- Broadcast Stream: Each partition contains the same data.
+
+## Partitioning
+
+Above we defined the stream and how it is partitioned. The next topic to 
discuss is how to convert
+between different partition types. We call these transformations partitioning.
+For example non-keyed partition stream can be transformed to keyed partition 
stream via a `KeyBy` partitioning.
+
+```java
+NonKeyedPartitionStream<Tuple<Integer, String>> stream = ...;
+KeyedPartitionStream<Integer, String> keyedStream = stream.keyBy(record -> 
record.f0);
+```
+
+Overall, we have the following four partitioning:
+ 
+- KeyBy: Let all data be repartitioned according to specified key.
+
+- Shuffle: Repartition and shuffle all data.
+
+- Global: Merge all partitions into one. 
+
+- Broadcast: Force partitions broadcast data to downstream.
+
+The specific transformation relationship is shown in the following table:
+
+| Partitioning | Global | Keyed | NonKeyed |      Broadcast      |
+|:------------:|:------:|:-----:|:--------:|:-------------------:|
+|    Global    |   ❎    | KeyBy | Shuffle  |      Broadcast     |
+|    Keyed     | Global | KeyBy | Shuffle  |      Broadcast      |
+|   NonKeyed   | Global | KeyBy | Shuffle  |      Broadcast      |
+|  Broadcast   |   ❎   |  ❎   |    ❎    |          ❎        |
+
+(A crossed box indicates that it is not supported or not required)
+
+One thing to note is: broadcast can only be used in conjunction with other 
inputs and cannot be directly converted to other streams.
+
+## ProcessFunction
+Once we have the data stream, we can apply operations on it. The operations 
that can be performed over
+DataStream are collectively called Process Function. It is the only entrypoint 
for defining all kinds
+of processing on the data streams.
+
+### Classification of ProcessFunction
+According to the number of input / output, they are classified as follows:
+
+|               Partitioning                |    Number of Inputs    |        
Number of Outputs        |
+|:-----------------------------------------:|:----------------------:|:-------------------------------:|
+|    OneInputStreamProcessFunction          |           1            |         
       1                |
+| TwoInputNonBroadcastStreamProcessFunction |           2            |         
       1                |
+|  TwoInputBroadcastStreamProcessFunction   |           2            |         
       1                |
+|      TwoOutputStreamProcessFunction       |           1            |         
       2                |
+
+Logically, process functions that support more inputs and outputs can be 
achieved by combining them, 
+but this implementation might be inefficient. If the call for this becomes 
louder, 
+we will consider supporting as many output edges as we want through a 
mechanism like OutputTag.
+But this loses the explicit generic type information that comes with using 
ProcessFunction.

Review Comment:
   removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [WIP][FLINK-37148][docs] Add documents for DataStream V2 [flink]

Reply via email to