> I have another question that whether rewrite sql and add a shuffle may cause performance drop in some scenes.
Yes, but that's the price random tables have to pay. Random's agg table means that it can no longer perform complete pre-aggregation operations. So there will be a loss of code performance. (It's the same no matter which solution we choose) Ling Miao 王博 <wangbo13...@gmail.com> 于2022年6月27日周一 12:37写道: > For Agg table, data distribution way should keep the consistent semantics, > from this perspective, I think solution 2 is better,this reduces the cost > of user understanding. > I have another question that whether rewrite sql and add a shuffle may > cause performance drop in some scenes. > > 蔡聪辉 <caiconghui2...@163.com> 于2022年6月26日周日 21:49写道: > > > > > > > > > Hi, all devs. Now Doris has supported random distribution for Duplicate > > Keys table and Aggregate Keys table without replace(replace_if_not_null) > > type column to solve data skew problem. > > > > But now, It brings some problem when we do some speical queris on > > aggregate table with random distribution. > > > > For example, if we have a aggreate table like following: > > > > > > > > > > Table agg_table > > > > > > > > > > ColumnNameType Aggregate Type > > > > k1char(5) > > > > v1 int max > > > > > > > > > > Origin Data Source, which include 5 rows : > > > > > > a 1 > > > > a 2 > > > > b 3 > > > > b 4 > > > > b 5 > > > > > > > > > > In hash distribution, the Doris can ensure that diffrent rows with same > > aggregate keys always in the same tablet(replica), > > > > > > > > > > so that the result for select * from agg_table is same with select * from > > (select k1, max(v1) from agg_table); > > > > > > > > > > but for the random distribution, aggregate keys table with diffrent rows > > with same aggregate keys may be in the difffrent tablet(replica), > > > > > > > > > > which may result in different reuslt from hash distribution. > > > > > > > > > > But what is the real root cause? This is because Doris leave out the > > aggregate function and group by statement if needed when do select column > > operation from aggregate table, > > > > and do aggregate function with group by operation only in the Doris > > storage layer. > > > > > > > > > > In the current design, count(1) and select just value column without > > aggregate funcion or group by stament for aggregate table now is diffrent > > from query reuslt on rollup which base on duiplicate table. > > > > > > > > > > For example, if we have duplcate table > > > > Table dup_table > > > > ColumnName Type > > > > k1 char(5) > > > > v1 int > > > > > > > > > > the the rollup is > > > > Rollup rollup_table > > > > ColumnNameType Aggregate Type > > > > k1char(5) > > > > v1 int max > > > > > > > > > > if we exectue query like "select * from dup_table " or "select count(1) > > from dup_table", we won't hit the rollup even if query rollup will cost > > less, this because select * or select count(1) > > > > not really match the v1 column with aggregate function like min. > > > > > > > > > > So. to make query result on random distribution consistent with hash > > distribution for aggregate table, here are two main solutions. > > > > 1. forbid directly select * from aggregate table, forbid select count(1) > > directly from aggregate table, and give the right sql to remind user that > > they should use aggregate function on value column or write group by > > statement explicitly. > > > > 2. keep the origin design, and we rewrite sql when user execute some > > special queries on aggregate table. > > > > > > These two solutions have their own advantages and disadvantages. You are > > welcome to give your opinions, feel free to discuss. > > > > > > -- > 王博 Wang Bo >