[
https://issues.apache.org/jira/browse/SPARK-55749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061755#comment-18061755
]
Matt Butrovich commented on SPARK-55749:
----------------------------------------
I think Spark is adhering closer to three-valued logic and the SQL spec, which
defines array equality comparison element-wise (section 8.2, General Rules,
case ii):
* X = Y is True if N1 = N2 and for all i, Xi = Yi is True
* X = Y is False if N1 ≠ N2 or NOT(Xi = Yi) is True for some i
* X = Y is Unknown if it is neither True nor False
So if an element is NULL, element-wise comparison yields Unknown, which can
make the whole array comparison Unknown (three-valued logic propagation).
array_contains isn't in the spec, but IN is (section 8.4) which translations to
ANY over element-wise comparison. Basically it does an OR of element-wise
comparison results, and if any of those results in Unknown the entire
expression becomes Unknown.
You could argue at a higher level that Spark and DuckDB disagree over whether
array_contains is an aggregate or a scalar expression/predicate. The former
eliminates Unknown while the latter propagates.
I think you could argue this either way, but I'm in favor of not changing
Spark's behavior on this one.
> array_contains built in function result are inconsistent with other engines
> ---------------------------------------------------------------------------
>
> Key: SPARK-55749
> URL: https://issues.apache.org/jira/browse/SPARK-55749
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Oleks V.
> Priority: Major
>
> Hi community
>
> I would love to get help on `array_contains` builtin function results when
> NULL is one of the values in the array. The function returns NULL which is
> opposite to other engines like DuckDB/Postresql.
>
> Example:
> {code:java}
> scala> spark.sql("select array_contains(array(1, null, 3), 2)").show(false)
> +------------------------------------+
> |array_contains(array(1, NULL, 3), 2)|
> +------------------------------------+
> |null |
> +------------------------------------+
> {code}
> For instance DuckDB returns false
>
>
> {code:java}
> D select array_has([1, null, 3], 2);
> ┌───────────────────────────────────────────┐
> │ array_has(main.list_value(1, NULL, 3), 2) │
> │ boolean │
> ├───────────────────────────────────────────┤
> │ false │
> └───────────────────────────────────────────┘
> D {code}
>
> Postgres doesn't have *array_contains* in Spark/DuckDB understanding but the
> closest analogue would be
>
> {code:java}
> select array[1, null, 3] @> array[2]
> false {code}
> For this ticket I'd like to understand if this behavior for Apache Spark is
> correct, because the documentation for the function does not mention such
> scenario explicitly
>
> {code:java}
> /**
> * Checks if the array (left) has the element (right)
> */
> @ExpressionDescription(
> usage = "_FUNC_(array, value) - Returns true if the array contains the
> value.",
> examples = """
> Examples:
> > SELECT _FUNC_(array(1, 2, 3), 2);
> true
> """,
> group = "array_funcs",
> since = "1.5.0")
> case class ArrayContains(left: Expression, right: Expression) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]