David Kunzmann created SPARK-51710: -------------------------------------- Summary: Using Dataframe.dropDuplicates with an empty array as argument behaves unexpectedly Key: SPARK-51710 URL: https://issues.apache.org/jira/browse/SPARK-51710 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.5 Reporter: David Kunzmann
When using PySpark DataFrame.dropDuplicates with an empty array as the subset argument, the resulting DataFrame contains a single row (the first row). This behavior is different than using DataFrame.dropDuplicates without any parameters or with None as the subset argument. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [ (1, "Alice"), (2, "Bob"), (3, "Alice"), (3, "Alice"), (2, "Bob") ] df = spark.createDataFrame(data, ["id", "name"]) df_dedup = df.dropDuplicates([]) df_dedup.show() {code} The above snippet will show the following DataFrame: {code:java} +---+-----+ | id| name| +---+-----+ | 1|Alice| +---+-----+ {code} I would expect the behavior to be the same as df.dropDuplicates() which returns: {code:java} +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Alice| +---+-----+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org