Warrick He created SPARK-51475:
----------------------------------

             Summary: ArrayDistinct Producing Inconsistent Behavior For -0.0 
and +0.0
                 Key: SPARK-51475
                 URL: https://issues.apache.org/jira/browse/SPARK-51475
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.5, 3.4.4, 3.5.0
            Reporter: Warrick He


This impacts array_distinct. This was tested on Spark versions 3.5.5, 3.5.0, 
and 3.4.4, but it likely affects all versions.

Problem: inconsistent behavior for 0.0 and -0.0. See below (ran on 3.5.5)
I'm not sure what the desired behavior is, does Spark want to follow the IEEE 
standard and set them to equal, giving only -0.0 or 0.0, or should it consider 
these distinct?
{quote}>>> spark.createDataFrame([([0.0, 6.0 -0.0],)], 
['values']).createOrReplaceTempView("tab")

>>> spark.sql("select array_distinct(values) from tab").show()

+----------------------+

|array_distinct(values)|

+----------------------+

|            [0.0, 6.0]|

+----------------------+

 

>>> spark.createDataFrame([([0.0, -0.0, 6.0],)], 
>>> ['values']).createOrReplaceTempView("tab")

>>> spark.sql("select array_distinct(values) from tab").show()

+----------------------+

|array_distinct(values)|

+----------------------+

|      [0.0, -0.0, 6.0]|

+----------------------+
{quote}

This issue could be related to the implementation of OpenHashSet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to