I support solution 2.

We add a new shuffle type, so we have to expand our agg code to support the new 
cases.

















At 2022-06-26 21:49:19, "蔡聪辉" <caiconghui2...@163.com> wrote:
>
>
>
>Hi, all devs. Now Doris has supported random distribution for Duplicate Keys 
>table and Aggregate Keys table without replace(replace_if_not_null) type 
>column to solve data skew problem.
>
>But now, It brings some problem when we do some speical queris on aggregate 
>table with random distribution. 
>
>For example, if we have a aggreate table like following:
>
>
>
>
>Table agg_table
>
>
>
>
>ColumnNameType      Aggregate Type
>
>k1char(5)
>
>v1                     int        max 
>
>
>
>
>Origin Data Source, which include 5 rows :                                     
>                        
>
>a  1
>
>a  2
>
>b  3
>
>b  4
>
>b  5
>
>
>
>
>In hash distribution, the Doris can ensure that diffrent rows with same 
>aggregate keys always in the same tablet(replica), 
>
>
>
>
>so that the result for select * from agg_table is same with select * from 
>(select k1, max(v1) from agg_table);
>
>
>
>
>but for the random distribution, aggregate keys table with diffrent rows with 
>same aggregate keys may be in the difffrent tablet(replica),
>
>
>
>
>which may result in different reuslt from hash distribution.
>
>
>
>
>But what is the real root cause? This is because Doris leave out the aggregate 
>function and group by statement if needed when do select column operation from 
>aggregate table,
>
>and do aggregate function with group by operation only in the Doris storage 
>layer.
>
>
>
>
>In the current design, count(1) and select just value column without aggregate 
>funcion or group by stament for aggregate table now is diffrent from query 
>reuslt on rollup which base on duiplicate table.
>
>
>
>
>For example, if we have duplcate table 
>
>Table dup_table
>
>ColumnName     Type 
>
>k1    char(5)
>
>v1                   int
>
>
>
>
>the the rollup is 
>
>   Rollup rollup_table
>
>ColumnNameType      Aggregate Type
>
>k1char(5)
>
>v1                     int        max 
>
>
>
>
>if we exectue query like "select * from dup_table " or "select count(1) from 
>dup_table", we won't hit the rollup even if query rollup will cost less, this 
>because select * or select count(1)
>
>not really match the v1 column with aggregate function like min.
>
>
>
>
>So. to make query result on random distribution consistent with hash 
>distribution for aggregate table, here are two main solutions.
>
>1. forbid directly select * from aggregate table, forbid select count(1) 
>directly from aggregate table, and give the right sql to remind user that they 
>should use aggregate function on value column or write group by statement 
>explicitly.
>
>2. keep the origin design, and we rewrite sql when user execute some special 
>queries on aggregate table. 
>
>
>These two solutions have their own advantages and disadvantages. You are 
>welcome to give your opinions, feel free to discuss.
>

Reply via email to