[ https://issues.apache.org/jira/browse/SPARK-51327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Srikanth Reddy Kumbham updated SPARK-51327: ------------------------------------------- Description: Spark Connect misinterprets df["*"] in a DataFrame select after a join, resolving it to all columns from the join output instead of the specified DataFrame's columns. This leads to unexpected columns in the result and ambiguous reference errors for duplicate column names. {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, "Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", "department_id", "department_name"]) departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, "Sales", 2000.0) ], ["department_id", "department_name", "budget"]) emp_dept = employees.join(departments, employees.department_id == departments.department_id, "left") .select( employees["*"], departments.department_name, departments.budget.alias("dept_budget") ) emp_dept.printSchema() emp_dept.show() emp_dept.select("department_name").show() {code} fails with error: {code:java} [AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: [`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704{code} was: Spark Connect misinterprets df["*"] in a DataFrame select after a join, resolving it to all columns from the join output instead of the specified DataFrame's columns. This leads to unexpected columns in the result and ambiguous reference errors for duplicate column names. ``` from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, "Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", "department_id", "department_name"]) departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, "Sales", 2000.0) ], ["department_id", "department_name", "budget"]) emp_dept = employees.join(departments, employees.department_id == departments.department_id, "left") .select( employees["*"], departments.department_name, departments.budget.alias("dept_budget") ) emp_dept.printSchema() emp_dept.show() emp_dept.select("department_name").show() ``` fails with error: ``` [AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: [`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704 ``` > [SPARK-CONNECT] unresolved_star in DataFrame select Mishandles Column Scoping > After Join > ---------------------------------------------------------------------------------------- > > Key: SPARK-51327 > URL: https://issues.apache.org/jira/browse/SPARK-51327 > Project: Spark > Issue Type: Bug > Components: Connect > Affects Versions: 3.5.4 > Environment: pyspark version: > pyspark==3.5.4 > spark-connect==org.apache.spark:spark-connect_2.13:3.5.4 > Reporter: Srikanth Reddy Kumbham > Priority: Major > > Spark Connect misinterprets df["*"] in a DataFrame select after a join, > resolving it to all columns from the join output instead of the specified > DataFrame's columns. This leads to unexpected columns in the result and > ambiguous reference errors for duplicate column names. > > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() > employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, > "Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", > "department_id", "department_name"]) > departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, > "Sales", 2000.0) ], ["department_id", "department_name", "budget"]) > emp_dept = employees.join(departments, employees.department_id == > departments.department_id, "left") .select( > employees["*"], > departments.department_name, > departments.budget.alias("dept_budget") > ) > emp_dept.printSchema() > emp_dept.show() > emp_dept.select("department_name").show() {code} > > fails with error: > {code:java} > [AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: > [`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org