[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

GitBox Mon, 15 Aug 2022 18:20:38 -0700


swuferhong commented on code in PR #20513:
URL: https://github.com/apache/flink/pull/20513#discussion_r946258022



##########
docs/content/docs/dev/table/sql/queries/hints.md:
##########
@@ -84,4 +84,219 @@ insert into kafka_table1 /*+ 
OPTIONS('sink.partitioner'='round-robin') */ select
 
 ```
 
+## Join Hints
+
+{{< label Batch >}}
+
+Join hints allow users to suggest the join strategy to optimizer in order to 
get a more optimal execution plan. 
+Currently, only supported in batch mode.
+
+### Supported Join Hints
+The following Join strategies are currently supported in Flink SQL for batch 
mode:
+
+
+**BroadCast Join**
+
+In this Join strategy, the data on the build side (ideally the smaller table) 
will be broadcast to each downstream operator, 
+and the data on the probe side (usually a large table) will be sent directly 
to the downstream operator with Forward. 
+Then the data on the build side (small table) will be built into a Hash Table 
for the probe side to query.
+
+**Hash Shuffle Join**
+
+In this Join strategy, the data on the Build side and the Probe side will be 
shuffled with the join key, and the data 
+with the same key will be distributed to the same downstream operator. Then 
the data on the build side (smaller table) 
+will be built into a Hash Table for the probe side to query.
+
+
+Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy 
does not need to shuffle the probe side, which saves a lot of 
+shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy 
is usually selected to avoid the shuffle 
+cost and improve computing performance. However, when the scale of the small 
table is large, the `BroadCast Join` strategy 
+is not applicable, because the overhead of redundant data transmission will 
exceed the overhead of shuffle.
+
+**Sort Merge Join**
+
+This Join strategy is aimed at the scenario of joining between two large 
tables or the scenario that the data at both 
+sides of the join is already in order. This strategy first shuffles the data 
on both sides of the join to the downstream 
+operator according to the Join Key. Then the downstream operator sorts the 
data before joining, and finally joins the data at both ends. 
+This strategy eliminates the need to load all data on one side into memory, 
thus relieving the pressure on computational memory.
+
+**Nested Loop Join**
+
+In this Join strategy, the probe side is used as an Outer Loop, and the build 
side is used as an Inner Loop, and the data is joined through two-layer loops.
+
+### Syntax
+The join hints syntax in Flink:
+```sql
+# query:
+SELECT /*+ hint_content[, hint_content] */ ...
+
+hint_content:
+    hint_strategy_name(hint_item[, hint_item])
+
+hint_strategy_name:
+    supported_join_hint_name
+
+hint_item:
+    string_literal
+```
+
+### Examples
+
+**BroadCast Join**
+
+*NOTE:* `BroadCast Join` strategy only supports join with equivalence join 
condition. And it not supports `Full Outer Join` join type.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select broadcast join strategy with t1, t1 will be the broadcast table.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select broadcast join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the broadcast table,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the broadcast table.
+SELECT /*+ BROADCAST(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON 
t1.id = t3.id;
+
+-- BroadCast join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of broadCast 
join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+
+-- BroadCast join strategy don't support full outer join type.
+-- For this case, default join strategy will be adopted instead of broadCast 
join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;
+```
+
+
+**Hash Shuffle Join**
+
+*NOTE:* `Hash Shuffle Join` strategy only supports join with equivalence join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select hash shuffle join strategy with t1, t1 will be the build side.
+SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select hash shuffle join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the build side,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the build side.
+SELECT /*+ SHUFFLE_HASH(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 
ON t1.id = t3.id;
+
+-- Hash shuffle join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of hash 
shuffle join strategy.
+SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+```
+
+**Sort Merge Join**
+
+
+*NOTE:* `Sort Merge Join` strategy only supports join with equivalence join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Sort merge join is adopted.
+SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Sort merge join is both adopted in these two joins.
+SELECT /*+ SHUFFLE_MERGE(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 
ON t1.id = t3.id;
+
+-- Sort merge join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of sort merge 
join strategy.
+SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+```
+
+
+**Nested Loop Join**
+
+*NOTE:* `Nest Loop Join` support both equivalent and non-equivalent join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select nest loop join strategy with t1, t1 will be the build side.
+SELECT /*+ NEST_LOOP(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select nest loop join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the build side,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the build side.
+SELECT /*+ NEST_LOOP(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON 
t1.id = t3.id;
+```
+
+**Conflict Cases In Join Hints**
+
+For some cases, join hints will conflict, we formulate the behavior of join 
hint in Flink for the following conditions:
+ - View
+ - Sub query with table alias
+ - Conflict in one same join hint strategy
+ - Conflict in different join hint strategies
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+
+
+-- View Case
+
+CREATE VIEW view1 AS SELECT t1.* FROM t1 JOIN ON t2 ON t1.id = t2.id;
+
+-- For this view case, broadCast join strategy will not be propagated into 
view1, and join inside view1 will use default join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t3 ON view1.id = t3.id;
+
+-- For this view case, the join inside view1 will use default join strategy, 
and join between view1 and t1 will use broadCast join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t1 ON view1.id = t1.id;
+
+
+
+-- Sub Query with Table Alias Case
+
+-- For this alias case, an exception that the alias_table not existent will be 
thrown.
+SELECT /*+ BROADCAST(alias_table) */ * FROM (SELECT * FROM t1) alias_table 
JOIN t2 on alias_table.id = t2.id;

Review Comment:
   > currently, join hint with query-block alias is support
   
   Ok



##########
docs/content/docs/dev/table/sql/queries/hints.md:
##########
@@ -84,4 +84,219 @@ insert into kafka_table1 /*+ 
OPTIONS('sink.partitioner'='round-robin') */ select
 
 ```
 
+## Join Hints
+
+{{< label Batch >}}
+
+Join hints allow users to suggest the join strategy to optimizer in order to 
get a more optimal execution plan. 
+Currently, only supported in batch mode.
+
+### Supported Join Hints
+The following Join strategies are currently supported in Flink SQL for batch 
mode:
+
+
+**BroadCast Join**
+
+In this Join strategy, the data on the build side (ideally the smaller table) 
will be broadcast to each downstream operator, 
+and the data on the probe side (usually a large table) will be sent directly 
to the downstream operator with Forward. 
+Then the data on the build side (small table) will be built into a Hash Table 
for the probe side to query.
+
+**Hash Shuffle Join**
+
+In this Join strategy, the data on the Build side and the Probe side will be 
shuffled with the join key, and the data 
+with the same key will be distributed to the same downstream operator. Then 
the data on the build side (smaller table) 
+will be built into a Hash Table for the probe side to query.
+
+
+Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy 
does not need to shuffle the probe side, which saves a lot of 
+shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy 
is usually selected to avoid the shuffle 
+cost and improve computing performance. However, when the scale of the small 
table is large, the `BroadCast Join` strategy 
+is not applicable, because the overhead of redundant data transmission will 
exceed the overhead of shuffle.
+
+**Sort Merge Join**
+
+This Join strategy is aimed at the scenario of joining between two large 
tables or the scenario that the data at both 
+sides of the join is already in order. This strategy first shuffles the data 
on both sides of the join to the downstream 
+operator according to the Join Key. Then the downstream operator sorts the 
data before joining, and finally joins the data at both ends. 
+This strategy eliminates the need to load all data on one side into memory, 
thus relieving the pressure on computational memory.
+
+**Nested Loop Join**
+
+In this Join strategy, the probe side is used as an Outer Loop, and the build 
side is used as an Inner Loop, and the data is joined through two-layer loops.
+
+### Syntax
+The join hints syntax in Flink:
+```sql
+# query:
+SELECT /*+ hint_content[, hint_content] */ ...
+
+hint_content:
+    hint_strategy_name(hint_item[, hint_item])
+
+hint_strategy_name:
+    supported_join_hint_name
+
+hint_item:
+    string_literal
+```
+
+### Examples
+
+**BroadCast Join**
+
+*NOTE:* `BroadCast Join` strategy only supports join with equivalence join 
condition. And it not supports `Full Outer Join` join type.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select broadcast join strategy with t1, t1 will be the broadcast table.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select broadcast join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the broadcast table,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the broadcast table.
+SELECT /*+ BROADCAST(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON 
t1.id = t3.id;
+
+-- BroadCast join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of broadCast 
join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+
+-- BroadCast join strategy don't support full outer join type.
+-- For this case, default join strategy will be adopted instead of broadCast 
join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;
+```
+
+
+**Hash Shuffle Join**
+
+*NOTE:* `Hash Shuffle Join` strategy only supports join with equivalence join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select hash shuffle join strategy with t1, t1 will be the build side.
+SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select hash shuffle join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the build side,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the build side.
+SELECT /*+ SHUFFLE_HASH(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 
ON t1.id = t3.id;
+
+-- Hash shuffle join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of hash 
shuffle join strategy.
+SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+```
+
+**Sort Merge Join**
+
+
+*NOTE:* `Sort Merge Join` strategy only supports join with equivalence join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Sort merge join is adopted.
+SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Sort merge join is both adopted in these two joins.
+SELECT /*+ SHUFFLE_MERGE(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 
ON t1.id = t3.id;
+
+-- Sort merge join strategy don't support non-equivalent join conditions.
+-- For this case, default join strategy will be adopted instead of sort merge 
join strategy.
+SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 join t2 ON t1.id > t2.id;
+```
+
+
+**Nested Loop Join**
+
+*NOTE:* `Nest Loop Join` support both equivalent and non-equivalent join 
condition.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select nest loop join strategy with t1, t1 will be the build side.
+SELECT /*+ NEST_LOOP(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select nest loop join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the build side,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the build side.
+SELECT /*+ NEST_LOOP(t1, t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON 
t1.id = t3.id;
+```
+
+**Conflict Cases In Join Hints**
+
+For some cases, join hints will conflict, we formulate the behavior of join 
hint in Flink for the following conditions:
+ - View
+ - Sub query with table alias
+ - Conflict in one same join hint strategy
+ - Conflict in different join hint strategies
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+
+
+-- View Case
+
+CREATE VIEW view1 AS SELECT t1.* FROM t1 JOIN ON t2 ON t1.id = t2.id;
+
+-- For this view case, broadCast join strategy will not be propagated into 
view1, and join inside view1 will use default join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t3 ON view1.id = t3.id;
+
+-- For this view case, the join inside view1 will use default join strategy, 
and join between view1 and t1 will use broadCast join strategy.
+SELECT /*+ BROADCAST(t1) */ * FROM view1 JOIN t1 ON view1.id = t1.id;
+
+
+
+-- Sub Query with Table Alias Case
+
+-- For this alias case, an exception that the alias_table not existent will be 
thrown.
+SELECT /*+ BROADCAST(alias_table) */ * FROM (SELECT * FROM t1) alias_table 
JOIN t2 on alias_table.id = t2.id;
+
+-- To avoid the above alias case, user need use view instead of table alias. 
Solution:

Review Comment:
   > this part can be removed
   
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

Reply via email to