Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

James Willis Fri, 09 May 2025 11:50:24 -0700

This seems like the correct behavior to me. Every value of the null set of
columns will match between any pair of Rows.




On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com>
wrote:

> Hello everyone,
>
> Following the creation of this PR
> <https://github.com/apache/spark/pull/50714> and the discussion in the
> thread. What do you think about the behavior described here:
>
> When using PySpark DataFrame.dropDuplicates with an empty array as the
>> subset argument, the resulting DataFrame contains a single row (the
>> first row). This behavior is different than using
>> DataFrame.dropDuplicates without any parameters or with None as the
>> subset argument. I would expect that passing an empty array to
>> dropDuplicates would use all the columns to detect duplicates and remove
>> them.
>>
>
>
> This behavior is the same on the Scala side where
> df.dropDuplicates(Seq.empty)  returns the first row.
>
> Would it make sense to change the behavior of df.dropDuplicates(Seq.empty)
> to be the same as df.dropDuplicates() ?
>
> Cheers,
>
> David
>
>

Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

Reply via email to