Hi Happen I think we can only support bucket join of single partition table. But I have some concerns
1. In real use, most scenes are created with multiple partitions. The possible scenarios for this optimization are limited. 2. It will make the data stream sender more complex. Because it is necessary to transfer the whole partition distribution information from Fe, and the hash mode of data import should be coupled when data hash calculation. In addition, can you post the POC profile content? I want to know the specific reasons for the promotion. Thanks, Zhao Chun Lee Happen <happen...@hotmail.com> 于2020年8月19日周三 下午2:39写道: > > Hi Zhao Chun > > Let me explain the two questions you mentioned > > > 1. Doris have data partition and data distribution meta, a certain bucket > seq and partition can be located to a tablet. Each tablet may be > distributed on different machines. if we both consider data partition and > data distribution. The Network consumption of bucket shuffle join will > degenerate. More machine more degradation. > > So Bucket shuffle join only when a single partition is queried can it play > its role > > 2. Before the query, the partition clipping will be carried out by FE. In > many query scenarios, queries are limited to a single partition or even a > few buckets in a single partition. In other words, the better the effect of > partition cutting, The better the performance of Bucket shuffle join. > > So we only need take care of distribution column, but not the data > partition column. > > 3. if we support bucket join for tables with group info. It's not much > different from colocate join. Most scenarios colocate join do well thing > than Bucket shuffle join. It also brings the problem of data skew. > > I think Bucket Shuffle Join Is a special scene optimization, it is > transparent to users. If there are multi partition in left table, we still > can use > shuffle join, broadcast join like before. > > > Best Wish, > Happen Lee > ________________________________ > From: Zhao Chun <zh...@apache.org> > Sent: August 19, 2020 5:53 > To: dev@doris.apache.org <dev@doris.apache.org> > Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris > > Hi Happen > > Good proposal. > > If we want to carry out this work, we need to pay attention to several > aspects. > > 1. The data partition and data distribution of Doris are two independent > modules. If you want to transfer data according to the data distribution, > the metadata information may be high, including all partition information > and bucket information. > > 2. Currently, Doris supports partition first and then distributed. Usually, > the two are based on different columns. This will cause the same > distribution columns to be on the same machine. For example, if table a is > partitioned according to the time column, and then the bucket is divided > according to the userid row, then the same userid in different partitions > will appear on different machines. In extreme cases, the same userid may > appear on all machines, which may degenerate into broadcast join when > joining the userid column. > > I think we should be able to support bucket join for tables with group > info. If we support all tables, there are still many points to consider. > > Thanks, > Zhao Chun > > > Lee Happen <happen...@hotmail.com> 于2020年8月19日周三 下午1:08写道: > > > > > Motivation > > > > At present, Doris support 3 type join: shuffle join, broadcast join, > > colocate join. > > Except colocate join,another join will lead to a lot of network > > consumption. > > > > For example, there a SQL A join B, the cost of network. > > > > * broadcast join: if table A is divided into three parts,the net work > > cost is 3B > > * shuffle join: the network cost is A + B. > > > > These network consumption not only leads to slow query, but also leads to > > extra memory consumption during join. > > > > Each Doris table have disrtribute info, if the join expr hit the > > distribute info, we should use the distribute info to reduce the network > > consumption. > > > > What is bucket shuffle join > > > > [image.png]< > > > https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430 > > > > > > > just like Hive's bucket map join, the picture show how it work. if there > a > > SQL A join B, and the join expr hit the distribute info of A. Bucket > > shuffle join only need distribute table B, sent the data to proper table > A > > part. So the network cost is always B. > > > > So compared with the original join, obviously bucket shuffle join lead to > > less network cost: > > > > B < min(3B, A + B) > > > > > > It can bring us the following benefits: > > > > 1. First, Bucket Shuffle Join reduce the network cost and lead to a > > better performance for some join. Especially when the bucket is cropped. > > > > 2. It does not strongly rely on the mechanism of collocate, so it is > > transparent to users. There is no mandatory requirement for data > > distribution, which will not lead to data skew. > > > > 3. It can provide more query optimization space for join reorder. > > > > POC of Bucket Shuffle Join > > > > Now I've implemented a simple Bucket Shuffle join in Doris and test the > > performance of it. > > > > Now, we chose tpcds query 57. The query have 6 join operation, and 4 of > > them can hit Bucket shuffle join. > > > > Origin Doris Bucket shuffle join > > Time Cost 27.7s 16.4s > > > > It seems to work as well as we expected. I'll do more experiments to > > verify its performance in the future > > > > Implementation > > > > 1. First, we should add a partition type in thrift type > > > > 2. FE able to plan and sense queries that can be used bucket shuffle > > join. send data distribution info to BE > > > > 3. BE use the proper hash function to send proper data to proper > > instance of BE. > > > > > > Best Wish > > Happen Lee > > > > > > > > >