Hello,
I was playing with some statistics information, and I started having weird 
results in Calcite execution for correlated aggregations that seems to be 
related to BindableTableScan and RelMdColumnUniqueness
Specifically, it seems that RelMdColumnUniqueness does not respect 
BindableTableScan’s  projections list when asking the statistics about specific 
column positions.

For example:
In our system the ID field of a table is considered unique, and this column is 
always at column position 0 of the table schema.
The table provides the bitset {0} in the key’s field of the Statistics object 
from table.getStatistic.

Lets look at this base query:
select Id, (select count(C0_21) from T2 where C0_21 = T1.Id) from T1
For each row of T1, I want the id, and the number of C0_21’s  which match.
We should the same number of rows from this query as T1.

When we provider no key’s information to in the table statistics, we  get the 
correct results:
ID, Count
1, 2
2, 0
3, 1
4, 2

When we provider the {0} bitset for key’s information in the table statistics, 
we do not get the correct results, we get (potentially) repeated rows of T1 for 
each match in T2.
ID, Count
1, 1
1, 1
2, 0
3, 1
4, 1
4, 1

Here are some print-outs from the CONVERTED and PLAN_BEFORE_IMPLEMENTATION 
hooks:
CONVERTED
LogicalProject(variablesSet=[[$cor0]], Id=[$0], EXPR$1=[$SCALAR_QUERY(
│....LogicalAggregate(group=[{}], EXPR$0=[COUNT($0)])
│....└── LogicalProject(C0_21=[$1])
│........└── LogicalFilter(condition=[=($1, $cor0.ID)])
│............└── QueryTableScan(table=[T2], Schema=[ID:Dimension, 
C0_21:Dimension, C0_22:Decimal(3, 3)])
│)])
└── QueryTableScan(table=[T1], Schema=[ID:Dimension, C0_12:Decimal(3, 3), 
C0_13:Text, C0_14:Timestamp])

PLAN_BEFORE_IMPLEMENTATION with no key information
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT], 
expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5])
..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left])
....EnumerableInterpreter
......BindableTableScan(table=[[QUERY, T1]], projects=[[0]])
....EnumerableAggregate(group=[{0}], EXPR$0=[COUNT($0)])
......EnumerableInterpreter
........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]], 
projects=[[1]])

PLAN_BEFORE_IMPLEMENTATION with {0} key information
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT], 
expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5])
..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left])
....EnumerableInterpreter
......BindableTableScan(table=[[QUERY, T1]], projects=[[0]])
....EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], 
expr#2=[1:BIGINT], expr#3=[0:BIGINT], expr#4=[CASE($t1, $t2, $t3)], 
C0_21=[$t0], $f1=[$t4])
......EnumerableInterpreter
........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]], 
projects=[[1]])

Notice that the EnumerableAggregate is missing when the key information is 
present.
If you set a breakpoint, you can ask 
RelMedataQuery.areColumnsUnique(bindableTableScan, {0}) and it will return 
true, which is not correct, since column 0 maps to column 1 in the table 
(C0_21) which is not unique.
I bet that this fact is then used by AggregateRemoveRule to incorrectly drop 
the aggregate, which results in multiple rows from T2 per T1.ID


RelMdColumnUniqueness.areColumnsUnqueue(TableScan) will attempt to unwrap a 
ColumnUniqueness.Handler from the table, but one is not given.
I feel like BindableTableScan should provide some wrapper to remap the 
projected columns.
We don’t do much configuration on when executing in the “native”/Enumerable 
Calcite pathway, so I don’t know if/what configurations we might be missing, 
but I don’t think we should need to augment the table statistics depending on 
what kind of table scan is provided.
Its also not clear to me how we would provide an alternative 
RelMdColumnUniqueness handler for BindableTableScan in this pathway (which is 
through CalcitePrepareImpl)

If this is not a configuration issue, I will file a bug JIRA

-Ian Bertolacci

Reply via email to