[ 
https://issues.apache.org/jira/browse/SPARK-51327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Reddy Kumbham updated SPARK-51327:
-------------------------------------------
    Description: 
Spark Connect misinterprets df["*"] in a DataFrame select after a join, 
resolving it to all columns from the join output instead of the specified 
DataFrame's columns. This leads to unexpected columns in the result and 
ambiguous reference errors for duplicate column names.

 
{code:java}
from pyspark.sql import SparkSession 
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, 
"Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", 
"department_id", "department_name"])
departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, "Sales", 
2000.0) ], ["department_id", "department_name", "budget"])
emp_dept = employees.join(departments, employees.department_id == 
departments.department_id, "left") .select(
    employees["*"], 
    departments.department_name, 
    departments.budget.alias("dept_budget")
)
emp_dept.printSchema() 
emp_dept.show() 
emp_dept.select("department_name").show() {code}
 

fails with error:
{code:java}

[AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: 
[`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704{code}

  was:
Spark Connect misinterprets df["*"] in a DataFrame select after a join, 
resolving it to all columns from the join output instead of the specified 
DataFrame's columns. This leads to unexpected columns in the result and 
ambiguous reference errors for duplicate column names.

```
from pyspark.sql import SparkSession 
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()


employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, 
"Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", 
"department_id", "department_name"])

departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, "Sales", 
2000.0) ], ["department_id", "department_name", "budget"])

emp_dept = employees.join(departments, employees.department_id == 
departments.department_id, "left") .select(
    employees["*"], 
    departments.department_name, 
    departments.budget.alias("dept_budget")
)

emp_dept.printSchema() 
emp_dept.show() 
emp_dept.select("department_name").show()
```

fails with error:

```
[AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: 
[`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704
```


> [SPARK-CONNECT] unresolved_star in DataFrame select Mishandles Column Scoping 
> After Join
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-51327
>                 URL: https://issues.apache.org/jira/browse/SPARK-51327
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 3.5.4
>         Environment: pyspark version: 
> pyspark==3.5.4
> spark-connect==org.apache.spark:spark-connect_2.13:3.5.4
>            Reporter: Srikanth Reddy Kumbham
>            Priority: Major
>
> Spark Connect misinterprets df["*"] in a DataFrame select after a join, 
> resolving it to all columns from the join output instead of the specified 
> DataFrame's columns. This leads to unexpected columns in the result and 
> ambiguous reference errors for duplicate column names.
>  
> {code:java}
> from pyspark.sql import SparkSession 
> spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
> employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, 
> "Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", 
> "department_id", "department_name"])
> departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, 
> "Sales", 2000.0) ], ["department_id", "department_name", "budget"])
> emp_dept = employees.join(departments, employees.department_id == 
> departments.department_id, "left") .select(
>     employees["*"], 
>     departments.department_name, 
>     departments.budget.alias("dept_budget")
> )
> emp_dept.printSchema() 
> emp_dept.show() 
> emp_dept.select("department_name").show() {code}
>  
> fails with error:
> {code:java}
> [AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: 
> [`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to