Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Zoltan Haindrich Wed, 12 Oct 2022 01:51:07 -0700


I think there might be an another approach; by making it possible to define 
safetiness of an operator - going back to the first example and considering:


SELECT e.name
FROM emp e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

and the pushed down version:

SELECT e.name
FROM (select * from emp WHERE (10 / empno)=1) e
INNER JOIN dep d ON e.deptno = d.deptno

There are systems which handle divison_by_zero as a shortcut to NULL; for those 
the above two queries are conveniently equivalent.

Limiting optimizations by stopping because of unsafety might hurt some system's 
performance - I think an alternate approach could be to:
* mark operators regarding their safety
* optionally provide a way to declare the safety conversion process;
  for DIVISION this could be: (case e.empno when 0 then NULL else e.empno end)
* during push down the condition with unsafe elements should also be kept at 
the top level to ensure that we don't loose correctness

so the pushed down version would be:
---
SELECT e.name
FROM (select * from emp WHERE  (10 / (case empno when 0 then NULL else empno 
end) ) = 1 ) e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

This approch could possibly also benefit from optimizations based 
constraints/etc more easily:
* say if empno is known to be >0 at compilation time
* the case could be removed
* get back to the original expression
* finally remove the identical conditional from the top level

cheers,
Zoltan




On 10/7/22 18:07, Stamatis Zampetakis wrote:

I just logged https://issues.apache.org/jira/browse/CALCITE-5315 but I am
not planning to work on this sometime soon.

Best,
Stamatis

On Fri, Feb 18, 2022 at 10:56 PM Stamatis Zampetakis <zabe...@gmail.com>
wrote:

Thank you all for the feedback.

I will do a small research about the proper term to use for functions that
might throw / never throw and log a JIRA to move the discussion further.

@Viliam : Note that the problem you mention, although similar, is not
exactly the same as the one I brought up now.

The reordering of predicates in the WHERE clause has been brought up quite
a few times in the dev list and our stance [1] is that since the standard
leaves this decision to the implementor people should not rely on this (or
put the appropriate guards).

The case here is a bit different, at least the way I read the standard,
cause it defines the following:

"If all optional clauses are omitted, then the result of the <table
expression> is the same as the result of the
<from clause>. Otherwise, each specified clause is applied to the result
of the previously specified clause
and the result of the <table expression> is the result of the application
of the last specified clause."

and one of the optional clauses mentioned in the previous paragraph is the
<where clause>. There seems to be a clearly defined order between the <from
clause>, which includes inner joins, and the <where clause>.

Best,
Stamatis

[1] https://lists.apache.org/thread/mq44cnrohz19hh10btms126vbcoxl50w

On Fri, Feb 18, 2022 at 9:58 AM Viliam Durina <vil...@hazelcast.com>
wrote:

I have observed this issue years ago in well-known databases. My case was
much simpler:

data
recordType:int  value:text
-----           ------
0               1
1               a

SELECT *
FROM data
WHERE recordType='1' AND CAST(value AS INT)<10


SQL is declarative, and unlike procedural languages, it doesn't prescribe
short-circuit evaluation of the WHERE clause, or any specific evaluation
order. If it was prescribed, the query would be perfectly safe. But
prescribing the evaluation order would rule out many optimizations, or
make
them much harder, such as this half-null-half-error value.

For example, reordering additions might or might not lead to overflow:
TINYINTs `100 + 100 - 90`, evaluated in this order, overflow, but `100 -
90
+ 100` don't - imagine each value comes from a different table and we
reorder the joins. Perhaps result of TINYINT addition can be SMALLINT, but
what if they are BIGINTs?

My understanding was that any expression must not fail in any plausible
execution plan. Therefore if I parse the column `value`, it must succeed
for every possible scanned row. In my specific case I ended up
implementing
a custom TO_NUMBER function that returns null on parsing error, and that
null never appeared in the result because of the other condition.

Viliam

On Thu, 17 Feb 2022 at 20:08, Julian Hyde <jhyde.apa...@gmail.com> wrote:

You should definitely log a bug, even if there are no plans to fix it.

It

is a violation of the standard, and therefore it is a bug, and

therefore we

should document that it exists.

Can you do some research on the right terminology. You use the term
‘unsafe’. I think the PL community uses the term ’strict’ for

expressions

that must be evaluated (and therefore if they throw, the query is sure

to

abort). We have other related concepts floating around in Calcite -

dynamic

functions, deterministic operators - and standardizing terminology is

key

if we are to keep the menagerie in order.

There might be a way to have our cake and eat it too. We could push down
the division, and if there is a division-by-zero we return a special

value

such as NULL. If the special value is not eliminated by the join then

it is

promoted to a throw. The “cake” here is the performance benefit of

pushing

down a filter (that may never or rarely throw); the “eating it” is

safety

on the occasion that the filter does throw.

Even if that strategy doesn’t work, maybe we could have a flag that says
whether to push down conditions that might throw.

Julian

On Feb 17, 2022, at 8:07 AM, Scott Reynolds <sdrreyno...@gmail.com>

wrote:


Is it feasible to prevent the filter push down in cases like this (

detect

potential division by zero) ? What other exceptions that could happen?

Should it only push down some filters and preserve the complex filter

above

the join?

Regarding a Jira ticket, generating an exception when it should

produce

valid results is gotta be a Jira case IMHO.




On Thu, Feb 17, 2022, 7:17 AM Stamatis Zampetakis <zabe...@gmail.com>

wrote:

Hi all,

Till today, I believed that pushing a filter below an inner join is

not

strictly related to the operators (equals, plus, minus, cast,

division)

present in the filter.

However, the query below will return some results if the filter is

executed

after the join or it will fail with an exception if the filter is

pushed

below the join.

EMP [EMPNO, NAME, DEPTNO]
0   | Alex | 0
10 | Bob | 1

DEP [DEPTNO]
1

SELECT e.name
FROM emp e
INNER JOIN dep d ON e.deptno = d.deptno
WHERE (10 / e.empno) = 1

It seems that SQL standard (Section 7.4 general rule 1) mandates that

WHERE

should be applied to the result of FROM so in the case above pushing
filters below the join seems to violate the standard.

Currently, in Calcite we are going to push the division, cast, and
basically any kind of operator below the join. Many well-known

databases

e.g., Postgres, do that as well.

Should we log it as a bug and try to do something about it or let it

be

as

it is?

Best,
Stamatis


--
This message contains confidential information and is intended only for
the
individuals named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. E-mail transmission cannot be
guaranteed to be secure or error-free as information could be
intercepted,
corrupted, lost, destroyed, arrive late or incomplete, or contain
viruses.
The sender therefore does not accept liability for any errors or
omissions
in the contents of this message, which arise as a result of e-mail
transmission. If verification is required, please request a hard-copy
version. -Hazelcast

OpenPGP_signature
Description: OpenPGP digital signature

Re: Pushing filters with unsafe operations (CAST/DIVISION) below inner joins

Reply via email to