solution 2, +1

It is better to make it easier for users to use.

Jianliang Qi

On Tue, Jun 28, 2022 at 2:36 PM Zhou Minghong <minghong.z...@163.com> wrote:

> I support solution 2.
>
>
> We add a new shuffle type, so we have to expand our agg code to support
> the new cases.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2022-06-26 21:49:19, "蔡聪辉" <caiconghui2...@163.com> wrote:
> >
> >
> >
> >Hi, all devs. Now Doris has supported random distribution for Duplicate
> Keys table and Aggregate Keys table without replace(replace_if_not_null)
> type column to solve data skew problem.
> >
> >But now, It brings some problem when we do some speical queris on
> aggregate table with random distribution.
> >
> >For example, if we have a aggreate table like following:
> >
> >
> >
> >
> >Table agg_table
> >
> >
> >
> >
> >ColumnNameType      Aggregate Type
> >
> >k1char(5)
> >
> >v1                     int        max
> >
> >
> >
> >
> >Origin Data Source, which include 5 rows :
>
> >
> >a  1
> >
> >a  2
> >
> >b  3
> >
> >b  4
> >
> >b  5
> >
> >
> >
> >
> >In hash distribution, the Doris can ensure that diffrent rows with same
> aggregate keys always in the same tablet(replica),
> >
> >
> >
> >
> >so that the result for select * from agg_table is same with select * from
> (select k1, max(v1) from agg_table);
> >
> >
> >
> >
> >but for the random distribution, aggregate keys table with diffrent rows
> with same aggregate keys may be in the difffrent tablet(replica),
> >
> >
> >
> >
> >which may result in different reuslt from hash distribution.
> >
> >
> >
> >
> >But what is the real root cause? This is because Doris leave out the
> aggregate function and group by statement if needed when do select column
> operation from aggregate table,
> >
> >and do aggregate function with group by operation only in the Doris
> storage layer.
> >
> >
> >
> >
> >In the current design, count(1) and select just value column without
> aggregate funcion or group by stament for aggregate table now is diffrent
> from query reuslt on rollup which base on duiplicate table.
> >
> >
> >
> >
> >For example, if we have duplcate table
> >
> >Table dup_table
> >
> >ColumnName     Type
> >
> >k1    char(5)
> >
> >v1                   int
> >
> >
> >
> >
> >the the rollup is
> >
> >   Rollup rollup_table
> >
> >ColumnNameType      Aggregate Type
> >
> >k1char(5)
> >
> >v1                     int        max
> >
> >
> >
> >
> >if we exectue query like "select * from dup_table " or "select count(1)
> from dup_table", we won't hit the rollup even if query rollup will cost
> less, this because select * or select count(1)
> >
> >not really match the v1 column with aggregate function like min.
> >
> >
> >
> >
> >So. to make query result on random distribution consistent with hash
> distribution for aggregate table, here are two main solutions.
> >
> >1. forbid directly select * from aggregate table, forbid select count(1)
> directly from aggregate table, and give the right sql to remind user that
> they should use aggregate function on value column or write group by
> statement explicitly.
> >
> >2. keep the origin design, and we rewrite sql when user execute some
> special queries on aggregate table.
> >
> >
> >These two solutions have their own advantages and disadvantages. You are
> welcome to give your opinions, feel free to discuss.
> >
>

Reply via email to