Hello, I was playing with some statistics information, and I started having weird results in Calcite execution for correlated aggregations that seems to be related to BindableTableScan and RelMdColumnUniqueness Specifically, it seems that RelMdColumnUniqueness does not respect BindableTableScan’s projections list when asking the statistics about specific column positions.
For example: In our system the ID field of a table is considered unique, and this column is always at column position 0 of the table schema. The table provides the bitset {0} in the key’s field of the Statistics object from table.getStatistic. Lets look at this base query: select Id, (select count(C0_21) from T2 where C0_21 = T1.Id) from T1 For each row of T1, I want the id, and the number of C0_21’s which match. We should the same number of rows from this query as T1. When we provider no key’s information to in the table statistics, we get the correct results: ID, Count 1, 2 2, 0 3, 1 4, 2 When we provider the {0} bitset for key’s information in the table statistics, we do not get the correct results, we get (potentially) repeated rows of T1 for each match in T2. ID, Count 1, 1 1, 1 2, 0 3, 1 4, 1 4, 1 Here are some print-outs from the CONVERTED and PLAN_BEFORE_IMPLEMENTATION hooks: CONVERTED LogicalProject(variablesSet=[[$cor0]], Id=[$0], EXPR$1=[$SCALAR_QUERY( │....LogicalAggregate(group=[{}], EXPR$0=[COUNT($0)]) │....└── LogicalProject(C0_21=[$1]) │........└── LogicalFilter(condition=[=($1, $cor0.ID)]) │............└── QueryTableScan(table=[T2], Schema=[ID:Dimension, C0_21:Dimension, C0_22:Decimal(3, 3)]) │)]) └── QueryTableScan(table=[T1], Schema=[ID:Dimension, C0_12:Decimal(3, 3), C0_13:Text, C0_14:Timestamp]) PLAN_BEFORE_IMPLEMENTATION with no key information EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT], expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5]) ..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]) ....EnumerableInterpreter ......BindableTableScan(table=[[QUERY, T1]], projects=[[0]]) ....EnumerableAggregate(group=[{0}], EXPR$0=[COUNT($0)]) ......EnumerableInterpreter ........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]], projects=[[1]]) PLAN_BEFORE_IMPLEMENTATION with {0} key information EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT], expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5]) ..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]) ....EnumerableInterpreter ......BindableTableScan(table=[[QUERY, T1]], projects=[[0]]) ....EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], expr#2=[1:BIGINT], expr#3=[0:BIGINT], expr#4=[CASE($t1, $t2, $t3)], C0_21=[$t0], $f1=[$t4]) ......EnumerableInterpreter ........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]], projects=[[1]]) Notice that the EnumerableAggregate is missing when the key information is present. If you set a breakpoint, you can ask RelMedataQuery.areColumnsUnique(bindableTableScan, {0}) and it will return true, which is not correct, since column 0 maps to column 1 in the table (C0_21) which is not unique. I bet that this fact is then used by AggregateRemoveRule to incorrectly drop the aggregate, which results in multiple rows from T2 per T1.ID RelMdColumnUniqueness.areColumnsUnqueue(TableScan) will attempt to unwrap a ColumnUniqueness.Handler from the table, but one is not given. I feel like BindableTableScan should provide some wrapper to remap the projected columns. We don’t do much configuration on when executing in the “native”/Enumerable Calcite pathway, so I don’t know if/what configurations we might be missing, but I don’t think we should need to augment the table statistics depending on what kind of table scan is provided. Its also not clear to me how we would provide an alternative RelMdColumnUniqueness handler for BindableTableScan in this pathway (which is through CalcitePrepareImpl) If this is not a configuration issue, I will file a bug JIRA -Ian Bertolacci