Alexander & Vineet,
One further comment about “NOT IN”. SQL in general is fairly close to
relational algebra, but “NOT IN” is one of the places where the gap is widest.
“NOT IN” is difficult in general to execute efficiently, because of the problem
of NULL values (at Oracle, we always recommended to users to rewrite as NOT
EXISTS if possible). The gap between SQL and relational algebra is apparent
when a short SQL query becomes a complex RelNode tree.
There is a silver lining: the RelNode tree, being relational algebra, has
well-behaved semantics. Once you’re in RelNode land, you can freely apply
transformation rules to make it efficient.
Vineet,
If the planner rules produce a plan that is sub-optimal I wouldn’t call it a
bug but a missing feature. (It would be a bug if the planner over-reached and
created a plan that gave wrong results, so I always tend to be conservative
about adding rules.)
Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant
projects and filters are no problem, and can be easily removed later with rules
that reduce constants and propagate predicates throughout the tree. But for the
general case of NOT IN, we have to add a self-join to deal with the possibility
that the key has NULL values. After constant reduction has kicked in and we
have realized that NULL key values are not possible, it is not easy to remove
that self-join.
Here is a very simple query where this happens:
sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json
sa ""
sqlline> !set outputformat csv
sqlline> explain plan for select * from scott.emp where deptno not in (
> select deptno from scott.dept where deptno = 20);
'PLAN'
'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)],
expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS
NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14,
$t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)],
proj#0..7=[{exprs}], $condition=[$t21])
EnumerableJoin(condition=[=($7, $10)], joinType=[left])
EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4],
MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0],
ck=[$t1])
EnumerableJoin(condition=[true], joinType=[inner])
JdbcToEnumerableConverter
JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
JdbcTableScan(table=[[SCOTT, DEPT]])
JdbcToEnumerableConverter
JdbcTableScan(table=[[SCOTT, EMP]])
JdbcToEnumerableConverter
JdbcAggregate(group=[{0, 1}])
JdbcProject(DEPTNO=[$0], i=[true])
JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
JdbcTableScan(table=[[SCOTT, DEPT]])
'
1 row selected (0.067 seconds)
Note that there are two scans of DEPT, but one is sufficient because DEPTNO is
never null. In the JdbcAggregate, c always equals ck, and therefore the CASE
can be simplified, and therefore the scan of DEPT that produces c and ck can be
dropped, but Calcite rules cannot deduce that fact.
Can you please log a JIRA case for this? See if you can find some other queries
(maybe using IN rather than NOT IN, or whose key columns are not so obviously
NOT NULL) and include these in the JIRA case also.
I doubt we can fix using a planner rule. The best solution may be to use
RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the
join.
Julian
> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <[email protected]>
> wrote:
>
> Julian, thank you for help.
>
> I had a wrong picture of NULL values processing. So, it looks like there is
> some problem in my planner rules.
> As for the AST, I was confused by the wrong Flink "explain()" function
> description :)
>
>
> Regards,
> Alexander
>
> -----Original Message-----
> From: Julian Hyde [mailto:[email protected]]
> Sent: Monday, October 31, 2016 10:43 PM
> To: [email protected]
> Subject: Re: Problems with abstract syntax tree
>
> The behavior of NOT IN in SQL is complicated when there are NULL values
> around. In particular, if one "word" value from the sub-query is null, then
> the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar'
> null)" would evaluate to UNKNOWN for every row.)
>
> It is valid to deduce that "word" in the sub-query is never null, because of
> the "WHERE word = 'hello'" condition. I would have hoped that a constant
> reduction could do that, and then maybe the CASE expression can be simplified.
>
> By the way, to be pedantic, what we are talking about here is the RelNode
> tree, the relational algebra, which comes out of the SqlToRelConverter. The
> AST is the SqlNode tree, which comes out of the parser and goes into the
> SqlToRelConverter.
>
> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin
> <[email protected]> wrote:
>> Hello, everybody.
>>
>> Trying to resolve an Apache Flink issue I got some troubles with Calcite.
>> Can you help me to understand is there a problem in Calcite or just in wrong
>> settings passed to Calcite functions?
>>
>> I have a simple table "Words" with one column named "word" and a query with
>> NOT IN operator:
>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM
>> Words WHERE word = 'hello')"
>>
>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and
>> then transformed to a relational tree by
>> org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>>
>> As a result I see the following abstract syntax tree
>> LogicalProject(word=[$0])
>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true,
>> IS NULL($3), null, <($2, $1), null, false))])
>> LogicalJoin(condition=[=($3, $4)], joinType=[left])
>> LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>> LogicalJoin(condition=[true], joinType=[inner])
>> EnumerableTableScan(table=[[Words]])
>> LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>> LogicalProject($f0=[$0], $f1=[true])
>> LogicalProject(word=[$0])
>> LogicalFilter(condition=[=($0, 'hello')])
>> EnumerableTableScan(table=[[Words]])
>> LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>> LogicalProject($f0=[$0], $f1=[true])
>> LogicalProject(word=[$0])
>> LogicalFilter(condition=[=($0, 'hello')])
>> EnumerableTableScan(table=[[Words]])
>>
>> which fails later during query plan optimization (while calling
>> org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>>
>> I think it might be because of a very complex abstract syntax tree generated
>> by Calcite. Shouldn't it be more simple? This one looks good for me:
>> LogicalProject(word=[$0])
>> LogicalFilter(condition=[IS NULL($2)])
>> LogicalJoin(condition=[=($0, $1)], joinType=[left])
>> EnumerableTableScan(table=[[Words]])
>> LogicalProject($f0=[$0], $f1=[true])
>> LogicalProject(word=[$0])
>> LogicalFilter(condition=[=($0, 'hello')])
>> EnumerableTableScan(table=[[Words]])
>>
>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree -
>> the optimization works fine. And the query execution result is the same as
>> must be in case of using NOT IN. So am I wrong with a supposition about bad
>> abstract syntax tree or not? I will be glad to receive any comments.
>>
>> Regards,
>> Alexander