Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

David Kunzmann Sat, 17 May 2025 07:37:08 -0700

Yes, sure it makes a lot of sense to be able to switch back between the two
different behaviors.


On Wed, May 14, 2025 at 10:48 AM Wenchen Fan <[email protected]> wrote:

> So you are basically saying df.dropDuplicates(Seq.empty) should be the
> same as df.dropDuplicates(all_columns). I think this is a reasonable
> change, as the previous behavior doesn't make sense which always returns
> the first row. For safety, we can add a legacy config for fallback and
> mention it in the migration guide.
>
> On Wed, May 14, 2025 at 9:21 AM David Kunzmann <[email protected]>
> wrote:
>
>> Hi James,
>> I see how the behavior makes sense now, but I was wondering why a user
>> would do this intentionally instead of using head() or first().
>> I thought it could mainly be done by mistake, as there is no benefit from
>> using  df.dropDuplicates(Seq.empty) .
>>
>> On Fri, May 9, 2025 at 8:50 PM James Willis <[email protected]> wrote:
>>
>>> This seems like the correct behavior to me. Every value of the null set
>>> of columns will match between any pair of Rows.
>>>
>>>
>>>
>>> On Thu, May 8, 2025 at 11:37 AM David Kunzmann <[email protected]>
>>> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> Following the creation of this PR
>>>> <https://github.com/apache/spark/pull/50714> and the discussion in the
>>>> thread. What do you think about the behavior described here:
>>>>
>>>> When using PySpark DataFrame.dropDuplicates with an empty array as the
>>>>> subset argument, the resulting DataFrame contains a single row (the
>>>>> first row). This behavior is different than using
>>>>> DataFrame.dropDuplicates without any parameters or with None as the
>>>>> subset argument. I would expect that passing an empty array to
>>>>> dropDuplicates would use all the columns to detect duplicates and
>>>>> remove them.
>>>>>
>>>>
>>>>
>>>> This behavior is the same on the Scala side where
>>>> df.dropDuplicates(Seq.empty)  returns the first row.
>>>>
>>>> Would it make sense to change the behavior of
>>>> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ?
>>>>
>>>> Cheers,
>>>>
>>>> David
>>>>
>>>>

Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

Reply via email to