Alexander & Vineet,

One further comment about “NOT IN”. SQL in general is fairly close to 
relational algebra, but “NOT IN” is one of the places where the gap is widest. 
“NOT IN” is difficult in general to execute efficiently, because of the problem 
of NULL values (at Oracle, we always recommended to users to rewrite as NOT 
EXISTS if possible). The gap between SQL and relational algebra is apparent 
when a short SQL query becomes a complex RelNode tree.

There is a silver lining: the RelNode tree, being relational algebra, has 
well-behaved semantics. Once you’re in RelNode land, you can freely apply 
transformation rules to make it efficient.

Vineet,

If the planner rules produce a plan that is sub-optimal I wouldn’t call it a 
bug but a missing feature. (It would be a bug if the planner over-reached and 
created a plan that gave wrong results, so I always tend to be conservative 
about adding rules.)

Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few redundant 
projects and filters are no problem, and can be easily removed later with rules 
that reduce constants and propagate predicates throughout the tree. But for the 
general case of NOT IN, we have to add a self-join to deal with the possibility 
that the key has NULL values. After constant reduction has kicked in and we 
have realized that NULL key values are not possible, it is not easy to remove 
that self-join.

Here is a very simple query where this happens:

sqlline> !connect jdbc:calcite:model=core/src/test/resources/hsqldb-model.json 
sa ""
sqlline> !set outputformat csv
sqlline> explain plan for select * from scott.emp where deptno not in (
       > select deptno from scott.dept where deptno = 20);
'PLAN'
'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], 
expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS 
NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, $t14, 
$t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], 
proj#0..7=[{exprs}], $condition=[$t21])
  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], JOB=[$t4], 
MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], c=[$t0], 
ck=[$t1])
      EnumerableJoin(condition=[true], joinType=[inner])
        JdbcToEnumerableConverter
          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
              JdbcTableScan(table=[[SCOTT, DEPT]])
        JdbcToEnumerableConverter
          JdbcTableScan(table=[[SCOTT, EMP]])
    JdbcToEnumerableConverter
      JdbcAggregate(group=[{0, 1}])
        JdbcProject(DEPTNO=[$0], i=[true])
          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
            JdbcTableScan(table=[[SCOTT, DEPT]])
'
1 row selected (0.067 seconds)

Note that there are two scans of DEPT, but one is sufficient because DEPTNO is 
never null. In the JdbcAggregate, c always equals ck, and therefore the CASE 
can be simplified, and therefore the scan of DEPT that produces c and ck can be 
dropped, but Calcite rules cannot deduce that fact.

Can you please log a JIRA case for this? See if you can find some other queries 
(maybe using IN rather than NOT IN, or whose key columns are not so obviously 
NOT NULL) and include these in the JIRA case also.

I doubt we can fix using a planner rule. The best solution may be to use 
RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add the 
join.

Julian


> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <[email protected]> 
> wrote:
> 
> Julian, thank you for help.
> 
> I had a wrong picture of NULL values processing. So, it looks like there is 
> some problem in my planner rules.
> As for the AST, I was confused by the wrong Flink "explain()" function 
> description :)
> 
> 
> Regards,
> Alexander
> 
> -----Original Message-----
> From: Julian Hyde [mailto:[email protected]] 
> Sent: Monday, October 31, 2016 10:43 PM
> To: [email protected]
> Subject: Re: Problems with abstract syntax tree
> 
> The behavior of NOT IN in SQL is complicated when there are NULL values 
> around. In particular, if one "word" value from the sub-query is null, then 
> the outer query must return 0 rows. (Why? Because "word NOT IN ('foo', 'bar' 
> null)" would evaluate to UNKNOWN for every row.)
> 
> It is valid to deduce that "word" in the sub-query is never null, because of 
> the "WHERE word = 'hello'" condition. I would have hoped that a constant 
> reduction could do that, and then maybe the CASE expression can be simplified.
> 
> By the way, to be pedantic, what we are talking about here is the RelNode 
> tree, the relational algebra, which comes out of the SqlToRelConverter. The 
> AST is the SqlNode tree, which comes out of the parser and goes into the 
> SqlToRelConverter.
> 
> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin 
> <[email protected]> wrote:
>> Hello, everybody.
>> 
>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. 
>> Can you help me to understand is there a problem in Calcite or just in wrong 
>> settings passed to Calcite functions?
>> 
>> I have a simple table "Words" with one column named "word" and a query with 
>> NOT IN operator:
>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM 
>> Words WHERE word = 'hello')"
>> 
>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() and 
>> then transformed to a relational tree by 
>> org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>> 
>> As a result I see the following abstract syntax tree
>> LogicalProject(word=[$0])
>>  LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, 
>> IS NULL($3), null, <($2, $1), null, false))])
>>    LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>      LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>        LogicalJoin(condition=[true], joinType=[inner])
>>          EnumerableTableScan(table=[[Words]])
>>          LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>            LogicalProject($f0=[$0], $f1=[true])
>>              LogicalProject(word=[$0])
>>                LogicalFilter(condition=[=($0, 'hello')])
>>                  EnumerableTableScan(table=[[Words]])
>>      LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>        LogicalProject($f0=[$0], $f1=[true])
>>          LogicalProject(word=[$0])
>>            LogicalFilter(condition=[=($0, 'hello')])
>>              EnumerableTableScan(table=[[Words]])
>> 
>> which fails later during query plan optimization (while calling 
>> org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>> 
>> I think it might be because of a very complex abstract syntax tree generated 
>> by Calcite. Shouldn't it be more simple? This one looks good for me:
>> LogicalProject(word=[$0])
>>  LogicalFilter(condition=[IS NULL($2)])
>>    LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>      EnumerableTableScan(table=[[Words]])
>>      LogicalProject($f0=[$0], $f1=[true])
>>        LogicalProject(word=[$0])
>>          LogicalFilter(condition=[=($0, 'hello')])
>>            EnumerableTableScan(table=[[Words]])
>> 
>> And when I write a query using LEFT OUTER JOIN to receive this syntax tree - 
>> the optimization works fine. And the query execution result is the same as 
>> must be in case of using NOT IN. So am I wrong with a supposition about bad 
>> abstract syntax tree or not? I will be glad to receive any comments.
>> 
>> Regards,
>> Alexander

Reply via email to