Hi, all devs. Now Doris has supported random distribution for Duplicate Keys 
table and Aggregate Keys table without replace(replace_if_not_null) type column 
to solve data skew problem.

But now, It brings some problem when we do some speical queris on aggregate 
table with random distribution. 

For example, if we have a aggreate table like following:




Table agg_table




ColumnNameType      Aggregate Type

k1char(5)

v1                     int        max 




Origin Data Source, which include 5 rows :                                      
                       

a  1

a  2

b  3

b  4

b  5




In hash distribution, the Doris can ensure that diffrent rows with same 
aggregate keys always in the same tablet(replica), 




so that the result for select * from agg_table is same with select * from 
(select k1, max(v1) from agg_table);




but for the random distribution, aggregate keys table with diffrent rows with 
same aggregate keys may be in the difffrent tablet(replica),




which may result in different reuslt from hash distribution.




But what is the real root cause? This is because Doris leave out the aggregate 
function and group by statement if needed when do select column operation from 
aggregate table,

and do aggregate function with group by operation only in the Doris storage 
layer.




In the current design, count(1) and select just value column without aggregate 
funcion or group by stament for aggregate table now is diffrent from query 
reuslt on rollup which base on duiplicate table.




For example, if we have duplcate table 

Table dup_table

ColumnName     Type 

k1    char(5)

v1                   int




the the rollup is 

   Rollup rollup_table

ColumnNameType      Aggregate Type

k1char(5)

v1                     int        max 




if we exectue query like "select * from dup_table " or "select count(1) from 
dup_table", we won't hit the rollup even if query rollup will cost less, this 
because select * or select count(1)

not really match the v1 column with aggregate function like min.




So. to make query result on random distribution consistent with hash 
distribution for aggregate table, here are two main solutions.

1. forbid directly select * from aggregate table, forbid select count(1) 
directly from aggregate table, and give the right sql to remind user that they 
should use aggregate function on value column or write group by statement 
explicitly.

2. keep the origin design, and we rewrite sql when user execute some special 
queries on aggregate table. 


These two solutions have their own advantages and disadvantages. You are 
welcome to give your opinions, feel free to discuss.

Reply via email to