[ https://issues.apache.org/jira/browse/SPARK-51710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945137#comment-17945137 ]
Afrin Jaman commented on SPARK-51710: ------------------------------------- Hello [~siriousjoke], I am willing to work to resolve this issue. Can you navigate me a little bit? I have 5+ years of experience in data engineering, but I am a new contributor to PySpark > Using Dataframe.dropDuplicates with an empty array as argument behaves > unexpectedly > ----------------------------------------------------------------------------------- > > Key: SPARK-51710 > URL: https://issues.apache.org/jira/browse/SPARK-51710 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.5.5 > Reporter: David Kunzmann > Priority: Major > > When using PySpark DataFrame.dropDuplicates with an empty array as the subset > argument, the resulting DataFrame contains a single row (the first row). This > behavior is different than using DataFrame.dropDuplicates without any > parameters or with None as the subset argument. > > {code:java} > from pyspark.sql import SparkSession > > spark = SparkSession.builder.getOrCreate() > data = [ > (1, "Alice"), > (2, "Bob"), > (3, "Alice"), > (3, "Alice"), > (2, "Bob") > ] > df = spark.createDataFrame(data, ["id", "name"]) > df_dedup = df.dropDuplicates([]) > df_dedup.show() > {code} > The above snippet will show the following DataFrame: > {code:java} > +---+-----+ > | id| name| > +---+-----+ > | 1|Alice| > +---+-----+ {code} > I would expect the behavior to be the same as df.dropDuplicates() which > returns: > {code:java} > +---+-----+ > | id| name| > +---+-----+ > | 1|Alice| > | 2| Bob| > | 3|Alice| > +---+-----+ {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org