Warrick He created SPARK-51475: ---------------------------------- Summary: ArrayDistinct Producing Inconsistent Behavior For -0.0 and +0.0 Key: SPARK-51475 URL: https://issues.apache.org/jira/browse/SPARK-51475 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.5, 3.4.4, 3.5.0 Reporter: Warrick He
This impacts array_distinct. This was tested on Spark versions 3.5.5, 3.5.0, and 3.4.4, but it likely affects all versions. Problem: inconsistent behavior for 0.0 and -0.0. See below (ran on 3.5.5) I'm not sure what the desired behavior is, does Spark want to follow the IEEE standard and set them to equal, giving only -0.0 or 0.0, or should it consider these distinct? {quote}>>> spark.createDataFrame([([0.0, 6.0 -0.0],)], ['values']).createOrReplaceTempView("tab") >>> spark.sql("select array_distinct(values) from tab").show() +----------------------+ |array_distinct(values)| +----------------------+ | [0.0, 6.0]| +----------------------+ >>> spark.createDataFrame([([0.0, -0.0, 6.0],)], >>> ['values']).createOrReplaceTempView("tab") >>> spark.sql("select array_distinct(values) from tab").show() +----------------------+ |array_distinct(values)| +----------------------+ | [0.0, -0.0, 6.0]| +----------------------+ {quote} This issue could be related to the implementation of OpenHashSet. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org