Are you saying that we can't fix this by rewriting expressions to translate from SQL to more natural semantics?
On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <owen.omal...@gmail.com> wrote: > In the SQL world, the second point isn't right. It is still the case that > not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three > valued logic) in SQL is just strange relative to programming languages: > > - null *=* "x" -> null > - null *is distinct from* "x" -> true > - *not*(null) -> null > - null *and* true -> null > - null *or* false -> null > > We absolutely need a null safe equals function (<=> or "is distinct from") > also, which is what we currently have as equals. So we really need to have > the logical operators also treat null as a special value. > > .. Owen > > > > On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> It would be nice to avoid the problem by changing the semantics of >> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main >> reasons. >> >> First, I think that API users creating expressions directly expect the >> current behavior. It would be surprising to a user if a notEqual >> expression didn’t return nulls. People integrating Iceberg in SQL engines >> will be more aware of SQL semantics, especially if the behavior we choose >> is documented. I think for API uses, the current behavior is better. >> >> Second, some evaluations require expressions to be rewritten without not, >> so we have a utility that pushes not down an expression tree to the leaf >> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col", >> "x"). If we were to change the semantics of notEqual, then this rewrite >> would no longer be valid. If col is null then it is not equal to x, and >> negating that result is true. But notEqual would give a different answer >> so we can’t rewrite it. >> >> We can work around the rewrite problem by adding Expressions.sqlNotEqual >> method for engines to call that has the SQL semantics by returning >> and(notEqual("col", >> "x"), notNull("col")). >> >> For pushdown, we should add tests for these cases and rewrite expressions >> to account for the difference. Iceberg should push notEqual("col", "x") >> to ORC as SQL (col != 'x' or col is null). Presto can similarly >> translate col != 'x' to and(notEqual("col", "x"), notNull("col"). >> >> rb >> >> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <owen.omal...@gmail.com> >> wrote: >> >>> I think that we should follow the SQL semantics to prevent surprises >>> when SQL engines integrate with Iceberg. >>> >>> .. Owen >>> >>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik < >>> shardulsmaha...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I noticed that Iceberg's predicates are not compatible with SQL >>>> predicates when it comes to handling NULL values. In SQL, if any of the >>>> operands of a scalar comparison predicate is NULL, then the resultant truth >>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a >>>> NULL in SQL and not FALSE. If such predicates are used as filters, the >>>> resultant output will be different for Iceberg v/s SQL. e.g. >>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’ >>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will >>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return >>>> more rows than required when an engine pushes down these predicates, >>>> however the engines will filter out these additional rows, so everything >>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely >>>> on Iceberg's expression evaluators for filtering will return the additional >>>> rows. Should we change the behavior of Iceberg expressions to be more >>>> SQL-like or should we keep this behavior and document the differences when >>>> compared with SQL? >>>> >>>> This also has some implications on predicate pushdown e.g. ORC follows >>>> SQL semantics and if we try to push down Iceberg predicates, simply >>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient >>>> as it does not return NULLs contrary to what Iceberg expects. >>>> >>>> Thanks, >>>> Shardul >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix