Re: Problems with abstract syntax tree

Vineet Garg Mon, 07 Nov 2016 21:40:28 -0800

I logged in jira CALCITE-1483 for the query in question. I’ll try to find more 
examples.





On 11/7/16, 9:17 PM, "Vineet Garg" <[email protected]> wrote:

>Hi Julian,
>
>Apologies for not responding earlier.
>
>I understand that planner rules sometime produces a plan that is sub-optimal. 
>My concern was about planner rules producing a plan consisting of an 
>expression (literal null constant in this case) with null type i.e. 
>SqlTypeName.NULL. I was wondering if this might be a bug on Calcite side. But 
>it looks like Calcite has a concept of null data type and this seems to be 
>expected. 
>
>Vineet
>
>
>
>On 11/3/16, 12:14 PM, "Julian Hyde" <[email protected]> wrote:
>
>>Vineet,
>>
>>In case you forgot, can you please log that JIRA case? If we have a lengthy 
>>design discussion without creating an action item, we are wasting everyone’s 
>>time.
>>
>>Julian
>>
>>> On Nov 1, 2016, at 11:00 AM, Julian Hyde <[email protected]> wrote:
>>> 
>>> Alexander & Vineet,
>>> 
>>> One further comment about “NOT IN”. SQL in general is fairly close to 
>>> relational algebra, but “NOT IN” is one of the places where the gap is 
>>> widest. “NOT IN” is difficult in general to execute efficiently, because of 
>>> the problem of NULL values (at Oracle, we always recommended to users to 
>>> rewrite as NOT EXISTS if possible). The gap between SQL and relational 
>>> algebra is apparent when a short SQL query becomes a complex RelNode tree.
>>> 
>>> There is a silver lining: the RelNode tree, being relational algebra, has 
>>> well-behaved semantics. Once you’re in RelNode land, you can freely apply 
>>> transformation rules to make it efficient.
>>> 
>>> Vineet,
>>> 
>>> If the planner rules produce a plan that is sub-optimal I wouldn’t call it 
>>> a bug but a missing feature. (It would be a bug if the planner over-reached 
>>> and created a plan that gave wrong results, so I always tend to be 
>>> conservative about adding rules.)
>>> 
>>> Usually it’s OK if we make a mess in SQL-to-RelNode conversion. A few 
>>> redundant projects and filters are no problem, and can be easily removed 
>>> later with rules that reduce constants and propagate predicates throughout 
>>> the tree. But for the general case of NOT IN, we have to add a self-join to 
>>> deal with the possibility that the key has NULL values. After constant 
>>> reduction has kicked in and we have realized that NULL key values are not 
>>> possible, it is not easy to remove that self-join.
>>> 
>>> Here is a very simple query where this happens:
>>> 
>>> sqlline> !connect 
>>> jdbc:calcite:model=core/src/test/resources/hsqldb-model.json sa ""
>>> sqlline> !set outputformat csv
>>> sqlline> explain plan for select * from scott.emp where deptno not in (
>>>> select deptno from scott.dept where deptno = 20);
>>> 'PLAN'
>>> 'EnumerableCalc(expr#0..11=[{inputs}], expr#12=[0], expr#13=[=($t8, $t12)], 
>>> expr#14=[false], expr#15=[IS NOT NULL($t11)], expr#16=[true], expr#17=[IS 
>>> NULL($t7)], expr#18=[null], expr#19=[<($t9, $t8)], expr#20=[CASE($t13, 
>>> $t14, $t15, $t16, $t17, $t18, $t19, $t16, $t14)], expr#21=[NOT($t20)], 
>>> proj#0..7=[{exprs}], $condition=[$t21])
>>>  EnumerableJoin(condition=[=($7, $10)], joinType=[left])
>>>    EnumerableCalc(expr#0..9=[{inputs}], EMPNO=[$t2], ENAME=[$t3], 
>>> JOB=[$t4], MGR=[$t5], HIREDATE=[$t6], SAL=[$t7], COMM=[$t8], DEPTNO=[$t9], 
>>> c=[$t0], ck=[$t1])
>>>      EnumerableJoin(condition=[true], joinType=[inner])
>>>        JdbcToEnumerableConverter
>>>          JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($0)])
>>>            JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>>              JdbcTableScan(table=[[SCOTT, DEPT]])
>>>        JdbcToEnumerableConverter
>>>          JdbcTableScan(table=[[SCOTT, EMP]])
>>>    JdbcToEnumerableConverter
>>>      JdbcAggregate(group=[{0, 1}])
>>>        JdbcProject(DEPTNO=[$0], i=[true])
>>>          JdbcFilter(condition=[=(CAST($0):INTEGER NOT NULL, 20)])
>>>            JdbcTableScan(table=[[SCOTT, DEPT]])
>>> '
>>> 1 row selected (0.067 seconds)
>>> 
>>> Note that there are two scans of DEPT, but one is sufficient because DEPTNO 
>>> is never null. In the JdbcAggregate, c always equals ck, and therefore the 
>>> CASE can be simplified, and therefore the scan of DEPT that produces c and 
>>> ck can be dropped, but Calcite rules cannot deduce that fact.
>>> 
>>> Can you please log a JIRA case for this? See if you can find some other 
>>> queries (maybe using IN rather than NOT IN, or whose key columns are not so 
>>> obviously NOT NULL) and include these in the JIRA case also.
>>> 
>>> I doubt we can fix using a planner rule. The best solution may be to use 
>>> RelMetadataQuery.getPulledUpPredicates() to simplify the CASE before we add 
>>> the join.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Nov 1, 2016, at 8:49 AM, Alexander Shoshin <[email protected]> 
>>>> wrote:
>>>> 
>>>> Julian, thank you for help.
>>>> 
>>>> I had a wrong picture of NULL values processing. So, it looks like there 
>>>> is some problem in my planner rules.
>>>> As for the AST, I was confused by the wrong Flink "explain()" function 
>>>> description :)
>>>> 
>>>> 
>>>> Regards,
>>>> Alexander
>>>> 
>>>> -----Original Message-----
>>>> From: Julian Hyde [mailto:[email protected]] 
>>>> Sent: Monday, October 31, 2016 10:43 PM
>>>> To: [email protected]
>>>> Subject: Re: Problems with abstract syntax tree
>>>> 
>>>> The behavior of NOT IN in SQL is complicated when there are NULL values 
>>>> around. In particular, if one "word" value from the sub-query is null, 
>>>> then the outer query must return 0 rows. (Why? Because "word NOT IN 
>>>> ('foo', 'bar' null)" would evaluate to UNKNOWN for every row.)
>>>> 
>>>> It is valid to deduce that "word" in the sub-query is never null, because 
>>>> of the "WHERE word = 'hello'" condition. I would have hoped that a 
>>>> constant reduction could do that, and then maybe the CASE expression can 
>>>> be simplified.
>>>> 
>>>> By the way, to be pedantic, what we are talking about here is the RelNode 
>>>> tree, the relational algebra, which comes out of the SqlToRelConverter. 
>>>> The AST is the SqlNode tree, which comes out of the parser and goes into 
>>>> the SqlToRelConverter.
>>>> 
>>>> On Mon, Oct 31, 2016 at 8:46 AM, Alexander Shoshin 
>>>> <[email protected]> wrote:
>>>>> Hello, everybody.
>>>>> 
>>>>> Trying to resolve an Apache Flink issue I got some troubles with Calcite. 
>>>>> Can you help me to understand is there a problem in Calcite or just in 
>>>>> wrong settings passed to Calcite functions?
>>>>> 
>>>>> I have a simple table "Words" with one column named "word" and a query 
>>>>> with NOT IN operator:
>>>>> val query = "SELECT word FROM Words WHERE word NOT IN (SELECT word FROM 
>>>>> Words WHERE word = 'hello')"
>>>>> 
>>>>> This query parsed by org.apache.calcite.sql.parser.SqlParser.parseStmt() 
>>>>> and then transformed to a relational tree by 
>>>>> org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(...).
>>>>> 
>>>>> As a result I see the following abstract syntax tree
>>>>> LogicalProject(word=[$0])
>>>>> LogicalFilter(condition=[NOT(CASE(=($1, 0), false, IS NOT NULL($5), true, 
>>>>> IS NULL($3), null, <($2, $1), null, false))])
>>>>>   LogicalJoin(condition=[=($3, $4)], joinType=[left])
>>>>>     LogicalProject($f0=[$0], $f1=[$1], $f2=[$2], $f3=[$0])
>>>>>       LogicalJoin(condition=[true], joinType=[inner])
>>>>>         EnumerableTableScan(table=[[Words]])
>>>>>         LogicalAggregate(group=[{}], agg#0=[COUNT()], agg#1=[COUNT($0)])
>>>>>           LogicalProject($f0=[$0], $f1=[true])
>>>>>             LogicalProject(word=[$0])
>>>>>               LogicalFilter(condition=[=($0, 'hello')])
>>>>>                 EnumerableTableScan(table=[[Words]])
>>>>>     LogicalAggregate(group=[{0}], agg#0=[MIN($1)])
>>>>>       LogicalProject($f0=[$0], $f1=[true])
>>>>>         LogicalProject(word=[$0])
>>>>>           LogicalFilter(condition=[=($0, 'hello')])
>>>>>             EnumerableTableScan(table=[[Words]])
>>>>> 
>>>>> which fails later during query plan optimization (while calling 
>>>>> org.apache.calcite.tools.Programs.RuleSetProgram.run()).
>>>>> 
>>>>> I think it might be because of a very complex abstract syntax tree 
>>>>> generated by Calcite. Shouldn't it be more simple? This one looks good 
>>>>> for me:
>>>>> LogicalProject(word=[$0])
>>>>> LogicalFilter(condition=[IS NULL($2)])
>>>>>   LogicalJoin(condition=[=($0, $1)], joinType=[left])
>>>>>     EnumerableTableScan(table=[[Words]])
>>>>>     LogicalProject($f0=[$0], $f1=[true])
>>>>>       LogicalProject(word=[$0])
>>>>>         LogicalFilter(condition=[=($0, 'hello')])
>>>>>           EnumerableTableScan(table=[[Words]])
>>>>> 
>>>>> And when I write a query using LEFT OUTER JOIN to receive this syntax 
>>>>> tree - the optimization works fine. And the query execution result is the 
>>>>> same as must be in case of using NOT IN. So am I wrong with a supposition 
>>>>> about bad abstract syntax tree or not? I will be glad to receive any 
>>>>> comments.
>>>>> 
>>>>> Regards,
>>>>> Alexander
>>> 
>>
>>

Re: Problems with abstract syntax tree

Reply via email to