Pranav Dev created SPARK-55636:
----------------------------------

             Summary: Spark Connect deduplicate throws generic INTERNAL_ERROR 
instead of UNRESOLVED_COLUMN_AMONG_FIELD_NAMES for invalid column names
                 Key: SPARK-55636
                 URL: https://issues.apache.org/jira/browse/SPARK-55636
             Project: Spark
          Issue Type: Bug
          Components: Connect
    Affects Versions: 4.2.0
            Reporter: Pranav Dev


When using Spark Connect, calling `dropDuplicates` with a non-existent column 
name throws a generic `INTERNAL_ERROR` with SQLSTATE `XX000` instead of the 
more helpful `UNRESOLVED_COLUMN_AMONG_FIELD_NAMES` error that classic Spark 
throws.

 

Steps to reproduce:
```
# Create a sample DataFrame
df1 = spark.createDataFrame([
(1,"Song A","Artist A"),
(2,"Song B","Artist B"),
(3,"Song C","Artist C")
], ["id", "song_name", "artist_name"])

df1.show()
df1.printSchema()

# Try to deduplicate on 'artist_id' which doesn't exist
df1.dropDuplicates(["artist_id"]).show()
```

 

Current behavior (Spark Connect):

```

[INTERNAL_ERROR] Invalid deduplicate column artist_id SQLSTATE: XX000

```

 

Classic Spark:
```
Cannot resolve column name "artist_id" among (id, song_name, artist_name).
```
 
Expected behavior (Spark Connect):
```
[UNRESOLVED_COLUMN_AMONG_FIELD_NAMES] Cannot resolve column name "artist_id" 
among (id, song_name, artist_name). SQLSTATE: 42703
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to