David Kunzmann created SPARK-51710:
--------------------------------------

             Summary: Using Dataframe.dropDuplicates with an empty array as 
argument behaves unexpectedly
                 Key: SPARK-51710
                 URL: https://issues.apache.org/jira/browse/SPARK-51710
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.5.5
            Reporter: David Kunzmann


When using PySpark DataFrame.dropDuplicates with an empty array as the subset 
argument, the resulting DataFrame contains a single row (the first row). This 
behavior is different than using DataFrame.dropDuplicates without any 
parameters or with None as the subset argument.

 
{code:java}
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.getOrCreate()
data = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Alice"),
    (3, "Alice"),
    (2, "Bob")
]
df = spark.createDataFrame(data, ["id", "name"])

df_dedup = df.dropDuplicates([])

df_dedup.show()
{code}
The above snippet will show the following DataFrame:
{code:java}
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
+---+-----+ {code}
I would expect the behavior to be the same as df.dropDuplicates() which returns:
{code:java}
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Alice|
+---+-----+ {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to