swuferhong commented on code in PR #20513: URL: https://github.com/apache/flink/pull/20513#discussion_r946258022
########## docs/content/docs/dev/table/sql/queries/hints.md: ########## @@ -84,4 +84,219 @@ insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select ``` +## Join Hints + +{{< label Batch >}} + +Join hints allow users to suggest the join strategy to optimizer in order to get a more optimal execution plan. +Currently, only supported in batch mode. + +### Supported Join Hints +The following Join strategies are currently supported in Flink SQL for batch mode: + + +**BroadCast Join** + +In this Join strategy, the data on the build side (ideally the smaller table) will be broadcast to each downstream operator, +and the data on the probe side (usually a large table) will be sent directly to the downstream operator with Forward. +Then the data on the build side (small table) will be built into a Hash Table for the probe side to query. + +**Hash Shuffle Join** + +In this Join strategy, the data on the Build side and the Probe side will be shuffled with the join key, and the data +with the same key will be distributed to the same downstream operator. Then the data on the build side (smaller table) +will be built into a Hash Table for the probe side to query. + + +Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy does not need to shuffle the probe side, which saves a lot of +shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy is usually selected to avoid the shuffle +cost and improve computing performance. However, when the scale of the small table is large, the `BroadCast Join` strategy +is not applicable, because the overhead of redundant data transmission will exceed the overhead of shuffle. + +**Sort Merge Join** + +This Join strategy is aimed at the scenario of joining between two large tables or the scenario that the data at both +sides of the join is already in order. This strategy first shuffles the data on both sides of the join to the downstream +operator according to the Join Key. Then the downstream operator sorts the data before joining, and finally joins the data at both ends. +This strategy eliminates the need to load all data on one side into memory, thus relieving the pressure on computational memory. + +**Nested Loop Join** + +In this Join strategy, the probe side is used as an Outer Loop, and the build side is used as an Inner Loop, and the data is joined through two-layer loops. + +### Syntax +The join hints syntax in Flink: +```sql +# query: +SELECT /*+ hint_content[, hint_content] */ ... + +hint_content: + hint_strategy_name(hint_item[, hint_item]) + +hint_strategy_name: + supported_join_hint_name + +hint_item: + string_literal +``` + +### Examples + +**BroadCast Join** + +*NOTE:* `BroadCast Join` strategy only supports join with equivalence join condition. And it not supports `Full Outer Join` join type. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select broadcast join strategy with t1, t1 will be the broadcast table. +SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select broadcast join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the broadcast table, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the broadcast table. +SELECT /*+ BROADCAST(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- BroadCast join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM t1 join t2 ON t1.id > t2.id; + +-- BroadCast join strategy don't support full outer join type. +-- For this case, default join strategy will be adopted instead of broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id; +``` + + +**Hash Shuffle Join** + +*NOTE:* `Hash Shuffle Join` strategy only supports join with equivalence join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select hash shuffle join strategy with t1, t1 will be the build side. +SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select hash shuffle join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the build side, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the build side. +SELECT /*+ SHUFFLE_HASH(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- Hash shuffle join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of hash shuffle join strategy. +SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 join t2 ON t1.id > t2.id; +``` + +**Sort Merge Join** + + +*NOTE:* `Sort Merge Join` strategy only supports join with equivalence join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Sort merge join is adopted. +SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Sort merge join is both adopted in these two joins. +SELECT /*+ SHUFFLE_MERGE(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- Sort merge join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of sort merge join strategy. +SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 join t2 ON t1.id > t2.id; +``` + + +**Nested Loop Join** + +*NOTE:* `Nest Loop Join` support both equivalent and non-equivalent join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select nest loop join strategy with t1, t1 will be the build side. +SELECT /*+ NEST_LOOP(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select nest loop join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the build side, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the build side. +SELECT /*+ NEST_LOOP(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; +``` + +**Conflict Cases In Join Hints** + +For some cases, join hints will conflict, we formulate the behavior of join hint in Flink for the following conditions: + - View + - Sub query with table alias + - Conflict in one same join hint strategy + - Conflict in different join hint strategies +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + + + +-- View Case + +CREATE VIEW view1 AS SELECT t1.* FROM t1 JOIN ON t2 ON t1.id = t2.id; + +-- For this view case, broadCast join strategy will not be propagated into view1, and join inside view1 will use default join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t3 ON view1.id = t3.id; + +-- For this view case, the join inside view1 will use default join strategy, and join between view1 and t1 will use broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t1 ON view1.id = t1.id; + + + +-- Sub Query with Table Alias Case + +-- For this alias case, an exception that the alias_table not existent will be thrown. +SELECT /*+ BROADCAST(alias_table) */ * FROM (SELECT * FROM t1) alias_table JOIN t2 on alias_table.id = t2.id; Review Comment: > currently, join hint with query-block alias is support Ok ########## docs/content/docs/dev/table/sql/queries/hints.md: ########## @@ -84,4 +84,219 @@ insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select ``` +## Join Hints + +{{< label Batch >}} + +Join hints allow users to suggest the join strategy to optimizer in order to get a more optimal execution plan. +Currently, only supported in batch mode. + +### Supported Join Hints +The following Join strategies are currently supported in Flink SQL for batch mode: + + +**BroadCast Join** + +In this Join strategy, the data on the build side (ideally the smaller table) will be broadcast to each downstream operator, +and the data on the probe side (usually a large table) will be sent directly to the downstream operator with Forward. +Then the data on the build side (small table) will be built into a Hash Table for the probe side to query. + +**Hash Shuffle Join** + +In this Join strategy, the data on the Build side and the Probe side will be shuffled with the join key, and the data +with the same key will be distributed to the same downstream operator. Then the data on the build side (smaller table) +will be built into a Hash Table for the probe side to query. + + +Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy does not need to shuffle the probe side, which saves a lot of +shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy is usually selected to avoid the shuffle +cost and improve computing performance. However, when the scale of the small table is large, the `BroadCast Join` strategy +is not applicable, because the overhead of redundant data transmission will exceed the overhead of shuffle. + +**Sort Merge Join** + +This Join strategy is aimed at the scenario of joining between two large tables or the scenario that the data at both +sides of the join is already in order. This strategy first shuffles the data on both sides of the join to the downstream +operator according to the Join Key. Then the downstream operator sorts the data before joining, and finally joins the data at both ends. +This strategy eliminates the need to load all data on one side into memory, thus relieving the pressure on computational memory. + +**Nested Loop Join** + +In this Join strategy, the probe side is used as an Outer Loop, and the build side is used as an Inner Loop, and the data is joined through two-layer loops. + +### Syntax +The join hints syntax in Flink: +```sql +# query: +SELECT /*+ hint_content[, hint_content] */ ... + +hint_content: + hint_strategy_name(hint_item[, hint_item]) + +hint_strategy_name: + supported_join_hint_name + +hint_item: + string_literal +``` + +### Examples + +**BroadCast Join** + +*NOTE:* `BroadCast Join` strategy only supports join with equivalence join condition. And it not supports `Full Outer Join` join type. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select broadcast join strategy with t1, t1 will be the broadcast table. +SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select broadcast join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the broadcast table, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the broadcast table. +SELECT /*+ BROADCAST(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- BroadCast join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM t1 join t2 ON t1.id > t2.id; + +-- BroadCast join strategy don't support full outer join type. +-- For this case, default join strategy will be adopted instead of broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id; +``` + + +**Hash Shuffle Join** + +*NOTE:* `Hash Shuffle Join` strategy only supports join with equivalence join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select hash shuffle join strategy with t1, t1 will be the build side. +SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select hash shuffle join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the build side, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the build side. +SELECT /*+ SHUFFLE_HASH(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- Hash shuffle join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of hash shuffle join strategy. +SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 join t2 ON t1.id > t2.id; +``` + +**Sort Merge Join** + + +*NOTE:* `Sort Merge Join` strategy only supports join with equivalence join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Sort merge join is adopted. +SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Sort merge join is both adopted in these two joins. +SELECT /*+ SHUFFLE_MERGE(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; + +-- Sort merge join strategy don't support non-equivalent join conditions. +-- For this case, default join strategy will be adopted instead of sort merge join strategy. +SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 join t2 ON t1.id > t2.id; +``` + + +**Nested Loop Join** + +*NOTE:* `Nest Loop Join` support both equivalent and non-equivalent join condition. +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + +-- Select nest loop join strategy with t1, t1 will be the build side. +SELECT /*+ NEST_LOOP(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id; + +-- Select nest loop join strategy with t1 and t3. +-- For this case, when join between t1 and t2, t1 will be the build side, +-- after that, when join between the result after t1 joins t2 and t3, t3 will be the build side. +SELECT /*+ NEST_LOOP(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id; +``` + +**Conflict Cases In Join Hints** + +For some cases, join hints will conflict, we formulate the behavior of join hint in Flink for the following conditions: + - View + - Sub query with table alias + - Conflict in one same join hint strategy + - Conflict in different join hint strategies +```sql +CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...); +CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...); + + + +-- View Case + +CREATE VIEW view1 AS SELECT t1.* FROM t1 JOIN ON t2 ON t1.id = t2.id; + +-- For this view case, broadCast join strategy will not be propagated into view1, and join inside view1 will use default join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t3 ON view1.id = t3.id; + +-- For this view case, the join inside view1 will use default join strategy, and join between view1 and t1 will use broadCast join strategy. +SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t1 ON view1.id = t1.id; + + + +-- Sub Query with Table Alias Case + +-- For this alias case, an exception that the alias_table not existent will be thrown. +SELECT /*+ BROADCAST(alias_table) */ * FROM (SELECT * FROM t1) alias_table JOIN t2 on alias_table.id = t2.id; + +-- To avoid the above alias case, user need use view instead of table alias. Solution: Review Comment: > this part can be removed done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org