reswqa commented on code in PR #26057: URL: https://github.com/apache/flink/pull/26057#discussion_r1926520983
########## docs/content/docs/dev/datastream-v2/building_blocks.md: ########## @@ -0,0 +1,215 @@ +--- +title: Building Blocks +weight: 3 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Building Blocks + +DataStream, Partitioning, ProcessFunction are the most fundamental elements of DataStream API and respectively represent: + +- What are the types of data streams + +- How data is partitioned + +- How to perform operations / processing on data streams + +They are also the core parts of the fundamental primitives provided by DataStream API. + +## DataStream + +Data flows on the stream may be divided into multiple partitions. According to how the data is partitioned on the stream, we divide it into the following categories: + +- Global Stream: Force single partition/parallelism, and the correctness of data depends on this. + +- Partition Stream: + - Divide data into multiple partitions. State is only available within the partition. One partition can only be processed by one task, but one task can handle one or multiple partitions. +According to the partitioning approach, it can be further divided into the following two categories: + + - Keyed Partition Stream: Each key is a partition, and the partition to which the data belongs is deterministic. + + - Non-Keyed Partition Stream: Each parallelism is a partition, and the partition to which the data belongs is nondeterministic. + +- Broadcast Stream: Each partition contains the same data. + +## Partitioning + +Above we defined the stream and how it is partitioned. The next topic to discuss is how to convert +between different partition types. We call these transformations partitioning. +For example non-keyed partition stream can be transformed to keyed partition stream via a `KeyBy` partitioning. + +```java +NonKeyedPartitionStream<Tuple<Integer, String>> stream = ...; +KeyedPartitionStream<Integer, String> keyedStream = stream.keyBy(record -> record.f0); +``` + +Overall, we have the following four partitioning: + +- KeyBy: Let all data be repartitioned according to specified key. + +- Shuffle: Repartition and shuffle all data. + +- Global: Merge all partitions into one. + +- Broadcast: Force partitions broadcast data to downstream. + +The specific transformation relationship is shown in the following table: + +| Partitioning | Global | Keyed | NonKeyed | Broadcast | +|:------------:|:------:|:-----:|:--------:|:-------------------:| +| Global | ❎ | KeyBy | Shuffle | Broadcast | +| Keyed | Global | KeyBy | Shuffle | Broadcast | +| NonKeyed | Global | KeyBy | Shuffle | Broadcast | +| Broadcast | ❎ | ❎ | ❎ | ❎ | + +(A crossed box indicates that it is not supported or not required) + +One thing to note is: broadcast can only be used in conjunction with other inputs and cannot be directly converted to other streams. + +## ProcessFunction +Once we have the data stream, we can apply operations on it. The operations that can be performed over +DataStream are collectively called Process Function. It is the only entrypoint for defining all kinds +of processing on the data streams. + +### Classification of ProcessFunction +According to the number of input / output, they are classified as follows: + +| Partitioning | Number of Inputs | Number of Outputs | +|:-----------------------------------------:|:----------------------:|:-------------------------------:| +| OneInputStreamProcessFunction | 1 | 1 | +| TwoInputNonBroadcastStreamProcessFunction | 2 | 1 | +| TwoInputBroadcastStreamProcessFunction | 2 | 1 | +| TwoOutputStreamProcessFunction | 1 | 2 | + +Logically, process functions that support more inputs and outputs can be achieved by combining them, +but this implementation might be inefficient. If the call for this becomes louder, +we will consider supporting as many output edges as we want through a mechanism like OutputTag. +But this loses the explicit generic type information that comes with using ProcessFunction. Review Comment: removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org