Are you saying that we can't fix this by rewriting expressions to translate
from SQL to more natural semantics?

On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> In the SQL world, the second point isn't right. It is still the case that
> not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
> valued logic) in SQL is just strange relative to programming languages:
>
>    - null *=* "x" -> null
>    - null *is distinct from* "x" -> true
>    - *not*(null) -> null
>    - null *and* true -> null
>    - null *or* false -> null
>
> We absolutely need a null safe equals function (<=> or "is distinct from")
> also, which is what we currently have as equals. So we really need to have
> the logical operators also treat null as a special value.
>
> .. Owen
>
>
>
> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> It would be nice to avoid the problem by changing the semantics of
>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>> reasons.
>>
>> First, I think that API users creating expressions directly expect the
>> current behavior. It would be surprising to a user if a notEqual
>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>> will be more aware of SQL semantics, especially if the behavior we choose
>> is documented. I think for API uses, the current behavior is better.
>>
>> Second, some evaluations require expressions to be rewritten without not,
>> so we have a utility that pushes not down an expression tree to the leaf
>> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
>> "x"). If we were to change the semantics of notEqual, then this rewrite
>> would no longer be valid. If col is null then it is not equal to x, and
>> negating that result is true. But notEqual would give a different answer
>> so we can’t rewrite it.
>>
>> We can work around the rewrite problem by adding Expressions.sqlNotEqual
>> method for engines to call that has the SQL semantics by returning 
>> and(notEqual("col",
>> "x"), notNull("col")).
>>
>> For pushdown, we should add tests for these cases and rewrite expressions
>> to account for the difference. Iceberg should push notEqual("col", "x")
>> to ORC as SQL (col != 'x' or col is null). Presto can similarly
>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>
>> rb
>>
>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>>
>>> I think that we should follow the SQL semantics to prevent surprises
>>> when SQL engines integrate with Iceberg.
>>>
>>> .. Owen
>>>
>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>> shardulsmaha...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>>> more rows than required when an engine pushes down these predicates,
>>>> however the engines will filter out these additional rows, so everything
>>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>>> on Iceberg's expression evaluators for filtering will return the additional
>>>> rows. Should we change the behavior of Iceberg expressions to be more
>>>> SQL-like or should we keep this behavior and document the differences when
>>>> compared with SQL?
>>>>
>>>> This also has some implications on predicate pushdown e.g. ORC follows
>>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>>> as it does not return NULLs contrary to what Iceberg expects.
>>>>
>>>> Thanks,
>>>> Shardul
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to