Re: [Discuss][Random Distribution Aggregate Table Query] Make consistent query result on Aggregate Keys table for both hash distribution and random distribution.

zhg yang Sun, 26 Jun 2022 20:22:35 -0700

I support solution 2. We should provide the same query interface for
different tables. The wrong query result is mainly caused by the query
planning not taking into account the new data shuffle method. This problem
is more like a bug than a feature, because I I think it's still a matter of
fixing this issue, rather than restricting how users can use it


Thanks
Yang Zhengguo


蔡聪辉 <[email protected]> 于2022年6月26日周日 21:50写道：

>
>
>
> Hi, all devs. Now Doris has supported random distribution for Duplicate
> Keys table and Aggregate Keys table without replace(replace_if_not_null)
> type column to solve data skew problem.
>
> But now, It brings some problem when we do some speical queris on
> aggregate table with random distribution.
>
> For example, if we have a aggreate table like following:
>
>
>
>
> Table agg_table
>
>
>
>
> ColumnNameType      Aggregate Type
>
> k1char(5)
>
> v1                     int        max
>
>
>
>
> Origin Data Source, which include 5 rows :
>
>
> a  1
>
> a  2
>
> b  3
>
> b  4
>
> b  5
>
>
>
>
> In hash distribution, the Doris can ensure that diffrent rows with same
> aggregate keys always in the same tablet(replica),
>
>
>
>
> so that the result for select * from agg_table is same with select * from
> (select k1, max(v1) from agg_table);
>
>
>
>
> but for the random distribution, aggregate keys table with diffrent rows
> with same aggregate keys may be in the difffrent tablet(replica),
>
>
>
>
> which may result in different reuslt from hash distribution.
>
>
>
>
> But what is the real root cause? This is because Doris leave out the
> aggregate function and group by statement if needed when do select column
> operation from aggregate table,
>
> and do aggregate function with group by operation only in the Doris
> storage layer.
>
>
>
>
> In the current design, count(1) and select just value column without
> aggregate funcion or group by stament for aggregate table now is diffrent
> from query reuslt on rollup which base on duiplicate table.
>
>
>
>
> For example, if we have duplcate table
>
> Table dup_table
>
> ColumnName     Type
>
> k1    char(5)
>
> v1                   int
>
>
>
>
> the the rollup is
>
>    Rollup rollup_table
>
> ColumnNameType      Aggregate Type
>
> k1char(5)
>
> v1                     int        max
>
>
>
>
> if we exectue query like "select * from dup_table " or "select count(1)
> from dup_table", we won't hit the rollup even if query rollup will cost
> less, this because select * or select count(1)
>
> not really match the v1 column with aggregate function like min.
>
>
>
>
> So. to make query result on random distribution consistent with hash
> distribution for aggregate table, here are two main solutions.
>
> 1. forbid directly select * from aggregate table, forbid select count(1)
> directly from aggregate table, and give the right sql to remind user that
> they should use aggregate function on value column or write group by
> statement explicitly.
>
> 2. keep the origin design, and we rewrite sql when user execute some
> special queries on aggregate table.
>
>
> These two solutions have their own advantages and disadvantages. You are
> welcome to give your opinions, feel free to discuss.
>
>

Re: [Discuss][Random Distribution Aggregate Table Query] Make consistent query result on Aggregate Keys table for both hash distribution and random distribution.

Reply via email to