I support solution 2. We should provide the same query interface for different tables. The wrong query result is mainly caused by the query planning not taking into account the new data shuffle method. This problem is more like a bug than a feature, because I I think it's still a matter of fixing this issue, rather than restricting how users can use it
Thanks Yang Zhengguo 蔡聪辉 <caiconghui2...@163.com> 于2022年6月26日周日 21:50写道: > > > > Hi, all devs. Now Doris has supported random distribution for Duplicate > Keys table and Aggregate Keys table without replace(replace_if_not_null) > type column to solve data skew problem. > > But now, It brings some problem when we do some speical queris on > aggregate table with random distribution. > > For example, if we have a aggreate table like following: > > > > > Table agg_table > > > > > ColumnNameType Aggregate Type > > k1char(5) > > v1 int max > > > > > Origin Data Source, which include 5 rows : > > > a 1 > > a 2 > > b 3 > > b 4 > > b 5 > > > > > In hash distribution, the Doris can ensure that diffrent rows with same > aggregate keys always in the same tablet(replica), > > > > > so that the result for select * from agg_table is same with select * from > (select k1, max(v1) from agg_table); > > > > > but for the random distribution, aggregate keys table with diffrent rows > with same aggregate keys may be in the difffrent tablet(replica), > > > > > which may result in different reuslt from hash distribution. > > > > > But what is the real root cause? This is because Doris leave out the > aggregate function and group by statement if needed when do select column > operation from aggregate table, > > and do aggregate function with group by operation only in the Doris > storage layer. > > > > > In the current design, count(1) and select just value column without > aggregate funcion or group by stament for aggregate table now is diffrent > from query reuslt on rollup which base on duiplicate table. > > > > > For example, if we have duplcate table > > Table dup_table > > ColumnName Type > > k1 char(5) > > v1 int > > > > > the the rollup is > > Rollup rollup_table > > ColumnNameType Aggregate Type > > k1char(5) > > v1 int max > > > > > if we exectue query like "select * from dup_table " or "select count(1) > from dup_table", we won't hit the rollup even if query rollup will cost > less, this because select * or select count(1) > > not really match the v1 column with aggregate function like min. > > > > > So. to make query result on random distribution consistent with hash > distribution for aggregate table, here are two main solutions. > > 1. forbid directly select * from aggregate table, forbid select count(1) > directly from aggregate table, and give the right sql to remind user that > they should use aggregate function on value column or write group by > statement explicitly. > > 2. keep the origin design, and we rewrite sql when user execute some > special queries on aggregate table. > > > These two solutions have their own advantages and disadvantages. You are > welcome to give your opinions, feel free to discuss. > >