> I have another question that whether rewrite sql and add a shuffle may
cause performance drop in some scenes.

Yes, but that's the price random tables have to pay.
Random's agg table means that it can no longer perform complete
pre-aggregation operations.
So there will be a loss of code performance. (It's the same no matter which
solution we choose)

Ling Miao

王博 <wangbo13...@gmail.com> 于2022年6月27日周一 12:37写道:

> For Agg table,  data distribution way should keep the consistent semantics,
> from this perspective, I think solution 2 is better,this reduces the cost
> of user understanding.
> I have another question that whether rewrite sql and add a shuffle may
> cause performance drop in some scenes.
>
> 蔡聪辉 <caiconghui2...@163.com> 于2022年6月26日周日 21:49写道:
>
> >
> >
> >
> > Hi, all devs. Now Doris has supported random distribution for Duplicate
> > Keys table and Aggregate Keys table without replace(replace_if_not_null)
> > type column to solve data skew problem.
> >
> > But now, It brings some problem when we do some speical queris on
> > aggregate table with random distribution.
> >
> > For example, if we have a aggreate table like following:
> >
> >
> >
> >
> > Table agg_table
> >
> >
> >
> >
> > ColumnNameType      Aggregate Type
> >
> > k1char(5)
> >
> > v1                     int        max
> >
> >
> >
> >
> > Origin Data Source, which include 5 rows :
> >
> >
> > a  1
> >
> > a  2
> >
> > b  3
> >
> > b  4
> >
> > b  5
> >
> >
> >
> >
> > In hash distribution, the Doris can ensure that diffrent rows with same
> > aggregate keys always in the same tablet(replica),
> >
> >
> >
> >
> > so that the result for select * from agg_table is same with select * from
> > (select k1, max(v1) from agg_table);
> >
> >
> >
> >
> > but for the random distribution, aggregate keys table with diffrent rows
> > with same aggregate keys may be in the difffrent tablet(replica),
> >
> >
> >
> >
> > which may result in different reuslt from hash distribution.
> >
> >
> >
> >
> > But what is the real root cause? This is because Doris leave out the
> > aggregate function and group by statement if needed when do select column
> > operation from aggregate table,
> >
> > and do aggregate function with group by operation only in the Doris
> > storage layer.
> >
> >
> >
> >
> > In the current design, count(1) and select just value column without
> > aggregate funcion or group by stament for aggregate table now is diffrent
> > from query reuslt on rollup which base on duiplicate table.
> >
> >
> >
> >
> > For example, if we have duplcate table
> >
> > Table dup_table
> >
> > ColumnName     Type
> >
> > k1    char(5)
> >
> > v1                   int
> >
> >
> >
> >
> > the the rollup is
> >
> >    Rollup rollup_table
> >
> > ColumnNameType      Aggregate Type
> >
> > k1char(5)
> >
> > v1                     int        max
> >
> >
> >
> >
> > if we exectue query like "select * from dup_table " or "select count(1)
> > from dup_table", we won't hit the rollup even if query rollup will cost
> > less, this because select * or select count(1)
> >
> > not really match the v1 column with aggregate function like min.
> >
> >
> >
> >
> > So. to make query result on random distribution consistent with hash
> > distribution for aggregate table, here are two main solutions.
> >
> > 1. forbid directly select * from aggregate table, forbid select count(1)
> > directly from aggregate table, and give the right sql to remind user that
> > they should use aggregate function on value column or write group by
> > statement explicitly.
> >
> > 2. keep the origin design, and we rewrite sql when user execute some
> > special queries on aggregate table.
> >
> >
> > These two solutions have their own advantages and disadvantages. You are
> > welcome to give your opinions, feel free to discuss.
> >
> >
>
> --
> 王博  Wang Bo
>

Reply via email to