Hi all,

I noticed that Iceberg's predicates are not compatible with SQL predicates
when it comes to handling NULL values. In SQL, if any of the operands of a
scalar comparison predicate is NULL, then the resultant truth value of the
predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and
not FALSE. If such predicates are used as filters, the resultant output
will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column,
'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The
same thing in Presto SQL `WHERE column != 'x'` will return rows excluding
both ‘x’ and NULL. So essentially, Iceberg can return more rows than
required when an engine pushes down these predicates, however the engines
will filter out these additional rows, so everything seems good. But
modules like iceberg-data and iceberg-mr which rely solely on Iceberg's
expression evaluators for filtering will return the additional rows. Should
we change the behavior of Iceberg expressions to be more SQL-like or should
we keep this behavior and document the differences when compared with SQL?

This also has some implications on predicate pushdown e.g. ORC follows SQL
semantics and if we try to push down Iceberg predicates, simply converting
Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does
not return NULLs contrary to what Iceberg expects.

Thanks,
Shardul

Reply via email to