Pranav Dev created SPARK-55636:
----------------------------------
Summary: Spark Connect deduplicate throws generic INTERNAL_ERROR
instead of UNRESOLVED_COLUMN_AMONG_FIELD_NAMES for invalid column names
Key: SPARK-55636
URL: https://issues.apache.org/jira/browse/SPARK-55636
Project: Spark
Issue Type: Bug
Components: Connect
Affects Versions: 4.2.0
Reporter: Pranav Dev
When using Spark Connect, calling `dropDuplicates` with a non-existent column
name throws a generic `INTERNAL_ERROR` with SQLSTATE `XX000` instead of the
more helpful `UNRESOLVED_COLUMN_AMONG_FIELD_NAMES` error that classic Spark
throws.
Steps to reproduce:
```
# Create a sample DataFrame
df1 = spark.createDataFrame([
(1,"Song A","Artist A"),
(2,"Song B","Artist B"),
(3,"Song C","Artist C")
], ["id", "song_name", "artist_name"])
df1.show()
df1.printSchema()
# Try to deduplicate on 'artist_id' which doesn't exist
df1.dropDuplicates(["artist_id"]).show()
```
Current behavior (Spark Connect):
```
[INTERNAL_ERROR] Invalid deduplicate column artist_id SQLSTATE: XX000
```
Classic Spark:
```
Cannot resolve column name "artist_id" among (id, song_name, artist_name).
```
Expected behavior (Spark Connect):
```
[UNRESOLVED_COLUMN_AMONG_FIELD_NAMES] Cannot resolve column name "artist_id"
among (id, song_name, artist_name). SQLSTATE: 42703
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]