This seems like the correct behavior to me. Every value of the null set of columns will match between any pair of Rows.
On Thu, May 8, 2025 at 11:37 AM David Kunzmann <davidkunzm...@gmail.com> wrote: > Hello everyone, > > Following the creation of this PR > <https://github.com/apache/spark/pull/50714> and the discussion in the > thread. What do you think about the behavior described here: > > When using PySpark DataFrame.dropDuplicates with an empty array as the >> subset argument, the resulting DataFrame contains a single row (the >> first row). This behavior is different than using >> DataFrame.dropDuplicates without any parameters or with None as the >> subset argument. I would expect that passing an empty array to >> dropDuplicates would use all the columns to detect duplicates and remove >> them. >> > > > This behavior is the same on the Scala side where > df.dropDuplicates(Seq.empty) returns the first row. > > Would it make sense to change the behavior of df.dropDuplicates(Seq.empty) > to be the same as df.dropDuplicates() ? > > Cheers, > > David > >