Hi, all devs. Now Doris has supported random distribution for Duplicate Keys
table and Aggregate Keys table without replace(replace_if_not_null) type column
to solve data skew problem.
But now, It brings some problem when we do some speical queris on aggregate
table with random distribution.
For example, if we have a aggreate table like following:
Table agg_table
ColumnNameType Aggregate Type
k1char(5)
v1 int max
Origin Data Source, which include 5 rows :
a 1
a 2
b 3
b 4
b 5
In hash distribution, the Doris can ensure that diffrent rows with same
aggregate keys always in the same tablet(replica),
so that the result for select * from agg_table is same with select * from
(select k1, max(v1) from agg_table);
but for the random distribution, aggregate keys table with diffrent rows with
same aggregate keys may be in the difffrent tablet(replica),
which may result in different reuslt from hash distribution.
But what is the real root cause? This is because Doris leave out the aggregate
function and group by statement if needed when do select column operation from
aggregate table,
and do aggregate function with group by operation only in the Doris storage
layer.
In the current design, count(1) and select just value column without aggregate
funcion or group by stament for aggregate table now is diffrent from query
reuslt on rollup which base on duiplicate table.
For example, if we have duplcate table
Table dup_table
ColumnName Type
k1 char(5)
v1 int
the the rollup is
Rollup rollup_table
ColumnNameType Aggregate Type
k1char(5)
v1 int max
if we exectue query like "select * from dup_table " or "select count(1) from
dup_table", we won't hit the rollup even if query rollup will cost less, this
because select * or select count(1)
not really match the v1 column with aggregate function like min.
So. to make query result on random distribution consistent with hash
distribution for aggregate table, here are two main solutions.
1. forbid directly select * from aggregate table, forbid select count(1)
directly from aggregate table, and give the right sql to remind user that they
should use aggregate function on value column or write group by statement
explicitly.
2. keep the origin design, and we rewrite sql when user execute some special
queries on aggregate table.
These two solutions have their own advantages and disadvantages. You are
welcome to give your opinions, feel free to discuss.