[ 
https://issues.apache.org/jira/browse/SPARK-55749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061755#comment-18061755
 ] 

Matt Butrovich commented on SPARK-55749:
----------------------------------------

I think Spark is adhering closer to three-valued logic and the SQL spec, which 
defines array equality comparison element-wise (section 8.2, General Rules, 
case ii):
 * X = Y is True if N1 = N2 and for all i, Xi = Yi is True
 * X = Y is False if N1 ≠ N2 or NOT(Xi = Yi) is True for some i
 * X = Y is Unknown if it is neither True nor False

So if an element is NULL, element-wise comparison yields Unknown, which can 
make the whole array comparison Unknown (three-valued logic propagation).

array_contains isn't in the spec, but IN is (section 8.4) which translations to 
ANY over element-wise comparison. Basically it does an OR of element-wise 
comparison results, and if any of those results in Unknown the entire 
expression becomes Unknown.

You could argue at a higher level that Spark and DuckDB disagree over whether 
array_contains is an aggregate or a scalar expression/predicate. The former 
eliminates Unknown while the latter propagates.

I think you could argue this either way, but I'm in favor of not changing 
Spark's behavior on this one.

> array_contains built in function result are inconsistent with other engines
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-55749
>                 URL: https://issues.apache.org/jira/browse/SPARK-55749
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Oleks V.
>            Priority: Major
>
> Hi community
>  
> I would love to get help on `array_contains` builtin function results when 
> NULL is one of the values in the array. The function returns NULL which is 
> opposite to other engines like DuckDB/Postresql. 
>  
> Example:
> {code:java}
> scala> spark.sql("select array_contains(array(1, null, 3), 2)").show(false)
> +------------------------------------+
> |array_contains(array(1, NULL, 3), 2)|
> +------------------------------------+
> |null                                |
> +------------------------------------+
>  {code}
> For instance DuckDB returns false 
>  
>  
> {code:java}
> D select array_has([1, null, 3], 2);
> ┌───────────────────────────────────────────┐
> │ array_has(main.list_value(1, NULL, 3), 2) │
> │                  boolean                  │
> ├───────────────────────────────────────────┤
> │ false                                     │
> └───────────────────────────────────────────┘
> D {code}
>  
> Postgres doesn't have *array_contains* in Spark/DuckDB understanding but the 
> closest analogue would be
>  
> {code:java}
> select array[1, null, 3] @> array[2]
> false {code}
> For this ticket I'd like to understand if this behavior for Apache Spark is 
> correct, because the documentation for the function does not mention such 
> scenario explicitly 
>  
> {code:java}
> /**
>  * Checks if the array (left) has the element (right)
>  */
> @ExpressionDescription(
>   usage = "_FUNC_(array, value) - Returns true if the array contains the 
> value.",
>   examples = """
>     Examples:
>       > SELECT _FUNC_(array(1, 2, 3), 2);
>        true
>   """,
>   group = "array_funcs",
>   since = "1.5.0")
> case class ArrayContains(left: Expression, right: Expression) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to