Yes, sure it makes a lot of sense to be able to switch back between the two different behaviors.
On Wed, May 14, 2025 at 10:48 AM Wenchen Fan <cloud0...@gmail.com> wrote: > So you are basically saying df.dropDuplicates(Seq.empty) should be the > same as df.dropDuplicates(all_columns). I think this is a reasonable > change, as the previous behavior doesn't make sense which always returns > the first row. For safety, we can add a legacy config for fallback and > mention it in the migration guide. > > On Wed, May 14, 2025 at 9:21 AM David Kunzmann <davidkunzm...@gmail.com> > wrote: > >> Hi James, >> I see how the behavior makes sense now, but I was wondering why a user >> would do this intentionally instead of using head() or first(). >> I thought it could mainly be done by mistake, as there is no benefit from >> using df.dropDuplicates(Seq.empty) . >> >> On Fri, May 9, 2025 at 8:50 PM James Willis <ja...@wherobots.com> wrote: >> >>> This seems like the correct behavior to me. Every value of the null set >>> of columns will match between any pair of Rows. >>> >>> >>> >>> On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com> >>> wrote: >>> >>>> Hello everyone, >>>> >>>> Following the creation of this PR >>>> <https://github.com/apache/spark/pull/50714> and the discussion in the >>>> thread. What do you think about the behavior described here: >>>> >>>> When using PySpark DataFrame.dropDuplicates with an empty array as the >>>>> subset argument, the resulting DataFrame contains a single row (the >>>>> first row). This behavior is different than using >>>>> DataFrame.dropDuplicates without any parameters or with None as the >>>>> subset argument. I would expect that passing an empty array to >>>>> dropDuplicates would use all the columns to detect duplicates and >>>>> remove them. >>>>> >>>> >>>> >>>> This behavior is the same on the Scala side where >>>> df.dropDuplicates(Seq.empty) returns the first row. >>>> >>>> Would it make sense to change the behavior of >>>> df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ? >>>> >>>> Cheers, >>>> >>>> David >>>> >>>>