This seems like the correct behavior to me. Every value of the null set of
columns will match between any pair of Rows.



On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com>
wrote:

> Hello everyone,
>
> Following the creation of this PR
> <https://github.com/apache/spark/pull/50714> and the discussion in the
> thread. What do you think about the behavior described here:
>
> When using PySpark DataFrame.dropDuplicates with an empty array as the
>> subset argument, the resulting DataFrame contains a single row (the
>> first row). This behavior is different than using
>> DataFrame.dropDuplicates without any parameters or with None as the
>> subset argument. I would expect that passing an empty array to
>> dropDuplicates would use all the columns to detect duplicates and remove
>> them.
>>
>
>
> This behavior is the same on the Scala side where
> df.dropDuplicates(Seq.empty)  returns the first row.
>
> Would it make sense to change the behavior of df.dropDuplicates(Seq.empty)
> to be the same as df.dropDuplicates() ?
>
> Cheers,
>
> David
>
>

Reply via email to