swuferhong commented on code in PR #20513: URL: https://github.com/apache/flink/pull/20513#discussion_r947390429
########## docs/content/docs/dev/table/sql/queries/hints.md: ########## @@ -84,4 +84,222 @@ insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select ``` +## Join Hints + +{{< label Batch >}} + +Join hints allow users to manually choose the join strategy in optimizer, and manually optimize the execution plan to improve the performance of the query. Now only support in batch mode. + +### Join Hints Strategies +The following Join strategies are currently supported in Flink SQL for batch job: + + +**BroadCast Join** + + +In this Join strategy, the data on the build side (usually a small table) will be broadcast to each downstream operator, +and the data on the probe side (usually a large table) will be sent directly to the downstream operator with Forward. +Then the data on the build side (small table) will be built into a Hash Table for the probe side to query. + +**Hash Shuffle Join** + + +In this Join strategy, the data on the Build side and the Probe side will be shuffled with the join key, and the data +with the same key will be distributed to the same downstream operator. Then the data on the build side (smaller table) +will be built into a Hash Table for the probe side to query. + + +Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy does not need to shuffle the probe side, which saves a lot of +shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy is usually selected to avoid the shuffle +cost and improve computing performance. However, when the scale of the small table is large, the `BroadCast Join` strategy +is not applicable, because the overhead of redundant data transmission will exceed the overhead of shuffle. + +**Sort Merge Join** + +This Join strategy is aimed at the scenario of joining between two large tables or the scenario that the data at both +sides of the join is already in order. This strategy first shuffles the data on both sides of the join to the downstream +operator according to the Join Key. Then the downstream operator sorts the data before joining, and finally joins the data at both ends. +This strategy eliminates the need to load all data on one side into memory, thus relieving the pressure on computational memory. + +**Nested Loop Join** + +In this Join strategy, the probe side is used as an Outer Loop, and the build side is used as an Inner Loop, and the data is joined through two-layer loops. + +### Syntax Review Comment: > > Can we move the syntax to a section alongside the 'Dynamic Table Options', name it 'Query Hints', and make 'Join Hints' as a child section (explain that the join hints syntax follows the query hints) > > +1 done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org