[ 
https://issues.apache.org/jira/browse/SPARK-51710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945137#comment-17945137
 ] 

Afrin Jaman commented on SPARK-51710:
-------------------------------------

Hello [~siriousjoke], I am willing to work to resolve this issue. Can you 
navigate me a little bit? I have 5+ years of experience in data engineering, 
but I am a new contributor to PySpark

> Using Dataframe.dropDuplicates with an empty array as argument behaves 
> unexpectedly
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-51710
>                 URL: https://issues.apache.org/jira/browse/SPARK-51710
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.5
>            Reporter: David Kunzmann
>            Priority: Major
>
> When using PySpark DataFrame.dropDuplicates with an empty array as the subset 
> argument, the resulting DataFrame contains a single row (the first row). This 
> behavior is different than using DataFrame.dropDuplicates without any 
> parameters or with None as the subset argument.
>  
> {code:java}
> from pyspark.sql import SparkSession
>  
> spark = SparkSession.builder.getOrCreate()
> data = [
>     (1, "Alice"),
>     (2, "Bob"),
>     (3, "Alice"),
>     (3, "Alice"),
>     (2, "Bob")
> ]
> df = spark.createDataFrame(data, ["id", "name"])
> df_dedup = df.dropDuplicates([])
> df_dedup.show()
> {code}
> The above snippet will show the following DataFrame:
> {code:java}
> +---+-----+
> | id| name|
> +---+-----+
> |  1|Alice|
> +---+-----+ {code}
> I would expect the behavior to be the same as df.dropDuplicates() which 
> returns:
> {code:java}
> +---+-----+
> | id| name|
> +---+-----+
> |  1|Alice|
> |  2|  Bob|
> |  3|Alice|
> +---+-----+ {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to