[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

GitBox Tue, 16 Aug 2022 19:20:30 -0700


swuferhong commented on code in PR #20513:
URL: https://github.com/apache/flink/pull/20513#discussion_r947390429



##########
docs/content/docs/dev/table/sql/queries/hints.md:
##########
@@ -84,4 +84,222 @@ insert into kafka_table1 /*+ 
OPTIONS('sink.partitioner'='round-robin') */ select
 
 ```
 
+## Join Hints
+
+{{< label Batch >}}
+
+Join hints allow users to manually choose the join strategy in optimizer, and 
manually optimize the execution plan to improve the performance of the query. 
Now only support in batch mode.
+
+### Join Hints Strategies
+The following Join strategies are currently supported in Flink SQL for batch 
job:
+
+
+**BroadCast Join**
+
+
+In this Join strategy, the data on the build side (usually a small table) will 
be broadcast to each downstream operator, 
+and the data on the probe side (usually a large table) will be sent directly 
to the downstream operator with Forward. 
+Then the data on the build side (small table) will be built into a Hash Table 
for the probe side to query.
+
+**Hash Shuffle Join**
+
+
+In this Join strategy, the data on the Build side and the Probe side will be 
shuffled with the join key, and the data 
+with the same key will be distributed to the same downstream operator. Then 
the data on the build side (smaller table) 
+will be built into a Hash Table for the probe side to query.
+
+
+Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy 
does not need to shuffle the probe side, which saves a lot of 
+shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy 
is usually selected to avoid the shuffle 
+cost and improve computing performance. However, when the scale of the small 
table is large, the `BroadCast Join` strategy 
+is not applicable, because the overhead of redundant data transmission will 
exceed the overhead of shuffle.
+
+**Sort Merge Join**
+
+This Join strategy is aimed at the scenario of joining between two large 
tables or the scenario that the data at both 
+sides of the join is already in order. This strategy first shuffles the data 
on both sides of the join to the downstream 
+operator according to the Join Key. Then the downstream operator sorts the 
data before joining, and finally joins the data at both ends. 
+This strategy eliminates the need to load all data on one side into memory, 
thus relieving the pressure on computational memory.
+
+**Nested Loop Join**
+
+In this Join strategy, the probe side is used as an Outer Loop, and the build 
side is used as an Inner Loop, and the data is joined through two-layer loops.
+
+### Syntax

Review Comment:
   > > Can we move the syntax to a section alongside the 'Dynamic Table 
Options', name it 'Query Hints', and make 'Join Hints' as a child section 
(explain that the join hints syntax follows the query hints)
   > 
   > +1
   
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

Reply via email to