[ 
https://issues.apache.org/jira/browse/SPARK-55636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Dev updated SPARK-55636:
-------------------------------
    Description: 
When using Spark Connect, calling `dropDuplicates` with a non-existent column 
name throws a generic `INTERNAL_ERROR` with SQLSTATE `XX000` instead of the 
more helpful `UNRESOLVED_COLUMN_AMONG_FIELD_NAMES` error that classic Spark 
throws.

 

Example to reproduce:

 
{code:java}
# Create a sample DataFrame
df1 = spark.createDataFrame([
(1,"Song A","Artist A"),
(2,"Song B","Artist B"),
(3,"Song C","Artist C")
], ["id", "song_name", "artist_name"])


df1.show()
df1.printSchema()


# Try to deduplicate on 'artist_id' which doesn't exist
df1.dropDuplicates(["artist_id"]).show() {code}
 

Current behavior (Spark Connect):

 
{code:java}
[INTERNAL_ERROR] Invalid deduplicate column artist_id SQLSTATE: XX000{code}
 

Classic Spark:
{code:java}
Cannot resolve column name "artist_id" among (id, song_name, artist_name).{code}
 
Expected behavior (Spark Connect):
{code:java}
[UNRESOLVED_COLUMN_AMONG_FIELD_NAMES] Cannot resolve column name "artist_id" 
among (id, song_name, artist_name). SQLSTATE: 42703{code}

  was:
When using Spark Connect, calling `dropDuplicates` with a non-existent column 
name throws a generic `INTERNAL_ERROR` with SQLSTATE `XX000` instead of the 
more helpful `UNRESOLVED_COLUMN_AMONG_FIELD_NAMES` error that classic Spark 
throws.

 

Steps to reproduce:
```
# Create a sample DataFrame
df1 = spark.createDataFrame([
(1,"Song A","Artist A"),
(2,"Song B","Artist B"),
(3,"Song C","Artist C")
], ["id", "song_name", "artist_name"])

df1.show()
df1.printSchema()

# Try to deduplicate on 'artist_id' which doesn't exist
df1.dropDuplicates(["artist_id"]).show()
```

Current behavior (Spark Connect):

```

[INTERNAL_ERROR] Invalid deduplicate column artist_id SQLSTATE: XX000

```

 

Classic Spark:
```
Cannot resolve column name "artist_id" among (id, song_name, artist_name).
```
 
Expected behavior (Spark Connect):
```
[UNRESOLVED_COLUMN_AMONG_FIELD_NAMES] Cannot resolve column name "artist_id" 
among (id, song_name, artist_name). SQLSTATE: 42703
```


> Spark Connect deduplicate throws generic INTERNAL_ERROR instead of 
> UNRESOLVED_COLUMN_AMONG_FIELD_NAMES for invalid column names
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-55636
>                 URL: https://issues.apache.org/jira/browse/SPARK-55636
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.2.0
>            Reporter: Pranav Dev
>            Priority: Major
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When using Spark Connect, calling `dropDuplicates` with a non-existent column 
> name throws a generic `INTERNAL_ERROR` with SQLSTATE `XX000` instead of the 
> more helpful `UNRESOLVED_COLUMN_AMONG_FIELD_NAMES` error that classic Spark 
> throws.
>  
> Example to reproduce:
>  
> {code:java}
> # Create a sample DataFrame
> df1 = spark.createDataFrame([
> (1,"Song A","Artist A"),
> (2,"Song B","Artist B"),
> (3,"Song C","Artist C")
> ], ["id", "song_name", "artist_name"])
> df1.show()
> df1.printSchema()
> # Try to deduplicate on 'artist_id' which doesn't exist
> df1.dropDuplicates(["artist_id"]).show() {code}
>  
> Current behavior (Spark Connect):
>  
> {code:java}
> [INTERNAL_ERROR] Invalid deduplicate column artist_id SQLSTATE: XX000{code}
>  
> Classic Spark:
> {code:java}
> Cannot resolve column name "artist_id" among (id, song_name, 
> artist_name).{code}
>  
> Expected behavior (Spark Connect):
> {code:java}
> [UNRESOLVED_COLUMN_AMONG_FIELD_NAMES] Cannot resolve column name "artist_id" 
> among (id, song_name, artist_name). SQLSTATE: 42703{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to