swuferhong commented on code in PR #20513: URL: https://github.com/apache/flink/pull/20513#discussion_r946257621
########## docs/content/docs/dev/table/sql/queries/hints.md: ########## @@ -84,4 +84,222 @@ insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select ``` +## Join Hints + +{{< label Batch >}} + +Join hints allow users to manually choose the join strategy in optimizer, and manually optimize the execution plan to improve the performance of the query. Now only support in batch mode. + +### Join Hints Strategies +The following Join strategies are currently supported in Flink SQL for batch job: + + +**BroadCast Join** + + +In this Join strategy, the data on the build side (usually a small table) will be broadcast to each downstream operator, +and the data on the probe side (usually a large table) will be sent directly to the downstream operator with Forward. +Then the data on the build side (small table) will be built into a Hash Table for the probe side to query. + +**Hash Shuffle Join** + + +In this Join strategy, the data on the Build side and the Probe side will be shuffled with the join key, and the data +with the same key will be distributed to the same downstream operator. Then the data on the build side (smaller table) +will be built into a Hash Table for the probe side to query. + + +Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy does not need to shuffle the probe side, which saves a lot of +shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy is usually selected to avoid the shuffle +cost and improve computing performance. However, when the scale of the small table is large, the `BroadCast Join` strategy +is not applicable, because the overhead of redundant data transmission will exceed the overhead of shuffle. + +**Sort Merge Join** + +This Join strategy is aimed at the scenario of joining between two large tables or the scenario that the data at both +sides of the join is already in order. This strategy first shuffles the data on both sides of the join to the downstream +operator according to the Join Key. Then the downstream operator sorts the data before joining, and finally joins the data at both ends. +This strategy eliminates the need to load all data on one side into memory, thus relieving the pressure on computational memory. + +**Nested Loop Join** + +In this Join strategy, the probe side is used as an Outer Loop, and the build side is used as an Inner Loop, and the data is joined through two-layer loops. + +### Syntax +The join hints syntax in Flink: +```sql +# query: +SELECT /*+ hint_content[, hint_content] */ ... + +hint_content: + hint_strategy_name(hint_item[, hint_item]) + +hint_strategy_name: + supported_join_hint_name + +hint_item: + string_literal +``` + +### Examples + +**BroadCast Join** + + +*NOTE:* `BroadCast Join` strategy only supports join with equivalence join condition. And it not supports `Full Outer Join` join type. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select broadcast join strategy with t1, t1 will be the broadcast table. +SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select broadcast join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the broadcast table, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the broadcast table. +SELECT /*+ BROADCAST(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- BroadCast join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM t1 join t2 ON t1.id > t2.id; + +-- BroadCast join strategy don't support full outer join type. Review Comment: > 'type' can be removed done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org