Re: [Discuss][Random Distribution Aggregate Table Query] Make consistent query result on Aggregate Keys table for both hash distribution and random distribution.

王博 Sun, 26 Jun 2022 21:37:15 -0700

For Agg table,  data distribution way should keep the consistent semantics,
from this perspective, I think solution 2 is better,this reduces the cost
of user understanding.
I have another question that whether rewrite sql and add a shuffle may
cause performance drop in some scenes.


蔡聪辉 <[email protected]> 于2022年6月26日周日 21:49写道：

>
>
>
> Hi, all devs. Now Doris has supported random distribution for Duplicate
> Keys table and Aggregate Keys table without replace(replace_if_not_null)
> type column to solve data skew problem.
>
> But now, It brings some problem when we do some speical queris on
> aggregate table with random distribution.
>
> For example, if we have a aggreate table like following:
>
>
>
>
> Table agg_table
>
>
>
>
> ColumnNameType      Aggregate Type
>
> k1char(5)
>
> v1                     int        max
>
>
>
>
> Origin Data Source, which include 5 rows :
>
>
> a  1
>
> a  2
>
> b  3
>
> b  4
>
> b  5
>
>
>
>
> In hash distribution, the Doris can ensure that diffrent rows with same
> aggregate keys always in the same tablet(replica),
>
>
>
>
> so that the result for select * from agg_table is same with select * from
> (select k1, max(v1) from agg_table);
>
>
>
>
> but for the random distribution, aggregate keys table with diffrent rows
> with same aggregate keys may be in the difffrent tablet(replica),
>
>
>
>
> which may result in different reuslt from hash distribution.
>
>
>
>
> But what is the real root cause? This is because Doris leave out the
> aggregate function and group by statement if needed when do select column
> operation from aggregate table,
>
> and do aggregate function with group by operation only in the Doris
> storage layer.
>
>
>
>
> In the current design, count(1) and select just value column without
> aggregate funcion or group by stament for aggregate table now is diffrent
> from query reuslt on rollup which base on duiplicate table.
>
>
>
>
> For example, if we have duplcate table
>
> Table dup_table
>
> ColumnName     Type
>
> k1    char(5)
>
> v1                   int
>
>
>
>
> the the rollup is
>
>    Rollup rollup_table
>
> ColumnNameType      Aggregate Type
>
> k1char(5)
>
> v1                     int        max
>
>
>
>
> if we exectue query like "select * from dup_table " or "select count(1)
> from dup_table", we won't hit the rollup even if query rollup will cost
> less, this because select * or select count(1)
>
> not really match the v1 column with aggregate function like min.
>
>
>
>
> So. to make query result on random distribution consistent with hash
> distribution for aggregate table, here are two main solutions.
>
> 1. forbid directly select * from aggregate table, forbid select count(1)
> directly from aggregate table, and give the right sql to remind user that
> they should use aggregate function on value column or write group by
> statement explicitly.
>
> 2. keep the origin design, and we rewrite sql when user execute some
> special queries on aggregate table.
>
>
> These two solutions have their own advantages and disadvantages. You are
> welcome to give your opinions, feel free to discuss.
>
>

-- 
王博  Wang Bo

Re: [Discuss][Random Distribution Aggregate Table Query] Make consistent query result on Aggregate Keys table for both hash distribution and random distribution.

Reply via email to