[GitHub] [incubator-doris] HappenLee opened a new issue #4394: [Proposal] Support Bucket Shuffle Join for Doris

GitBox Tue, 18 Aug 2020 22:05:00 -0700


HappenLee opened a new issue #4394:
URL: https://github.com/apache/incubator-doris/issues/4394

## Motivation
At present, Doris support 3 type join: **shuffle join**, **broadcast
join**, **colocate join**.
Except colocate join，another join will lead to a lot of network consumption.

For example, there a SQL A join B, the cost of network.
* **broadcast join**: if table A is divided into three parts，the net work
cost is ``` 3B```
* **shuffle join**: the network cost is ```A + B```.

These network consumption not only leads to slow query, but also leads to
extra memory consumption during join.

Each Doris table have disrtribute info, if the join expr hit the distribute
info, we should use the distribute info to reduce the network consumption.

## What is bucket shuffle join

![image.png](https://upload-images.jianshu.io/upload_images/8552201-c383fe84aeee13bc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

just like Hive's bucket map join, the picture show how it work. if there a
SQL A join B, and the join expr hit the distribute info of A. Bucket shuffle
join only need distribute table B, sent the data to proper table A part. So the
network cost is always ```B```.

So compared with the original join, obviously bucket shuffle join lead to
less network cost：

B < min(3B, A + B)

### It can bring us the following benefits:

1. First, Bucket Shuffle Join reduce the network cost and lead to a better
performance for some join. Especially when the bucket is cropped.

2. It does not strongly rely on the mechanism of collocate, so it is
transparent to users. There is no mandatory requirement for data distribution,
which will not lead to data skew.

3. It can provide more query optimization space for join reorder.

## POC of Bucket Shuffle Join

Now I've implemented a simple Bucket Shuffle join in Doris and test the
performance of it.

Now, we chose tpcds query 57. The query have 6 join operation, and 4 of them
can hit Bucket shuffle join.

| | Origin Doris | Bucket shuffle join |
| :----: | :----: | :----: |
| Time Cost | 27.7s | 16.4s |

It seems to work as well as we expected. I'll do more experiments to verify
its performance in the future

## Implementation

1. First, we should add a partition type in thrift type

2. FE able to plan and sense queries that can be used bucket shuffle join.
send data distribution info to BE

3. BE use the proper hash function to send proper data to proper instance of
BE.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] HappenLee opened a new issue #4394: [Proposal] Support Bucket Shuffle Join for Doris

Reply via email to