Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-17 Thread David Kunzmann
onable > change, as the previous behavior doesn't make sense which always returns > the first row. For safety, we can add a legacy config for fallback and > mention it in the migration guide. > > On Wed, May 14, 2025 at 9:21 AM David Kunzmann > wrote: > >> Hi James, >

Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-14 Thread David Kunzmann
Willis wrote: > This seems like the correct behavior to me. Every value of the null set of > columns will match between any pair of Rows. > > > > On Thu, May 8, 2025 at 11:37 AM David Kunzmann > wrote: > >> Hello everyone, >> >> Following the creation of t

[DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-08 Thread David Kunzmann
Hello everyone, Following the creation of this PR and the discussion in the thread. What do you think about the behavior described here: When using PySpark DataFrame.dropDuplicates with an empty array as the > subset argument, the resulting DataFrame