Srikanth Reddy Kumbham created SPARK-51327: ----------------------------------------------
Summary: [SPARK-CONNECT] unresolved_star in DataFrame select Mishandles Column Scoping After Join Key: SPARK-51327 URL: https://issues.apache.org/jira/browse/SPARK-51327 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.4 Environment: pyspark version: pyspark==3.5.4 spark-connect==org.apache.spark:spark-connect_2.13:3.5.4 Reporter: Srikanth Reddy Kumbham Spark Connect misinterprets df["*"] in a DataFrame select after a join, resolving it to all columns from the join output instead of the specified DataFrame's columns. This leads to unexpected columns in the result and ambiguous reference errors for duplicate column names. ``` from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() # Sample data employees = spark.createDataFrame([ (1, "John", "Doe", 1, "Eng Dept"), (2, "Jane", "Smith", 1, "Eng Dept") ], ["employee_id", "first_name", "last_name", "department_id", "department_name"]) departments = spark.createDataFrame([ (1, "Engineering", 1000.0), (2, "Sales", 2000.0) ], ["department_id", "department_name", "budget"]) # Join and select emp_dept = employees.join(departments, employees.department_id == departments.department_id, "left") .select( employees["*"], departments.department_name, departments.budget.alias("dept_budget") ) emp_dept.printSchema() emp_dept.show() emp_dept.select("department_name").show() ``` fails with error: ``` [AMBIGUOUS_REFERENCE] Reference `DEPARTMENT_NAME` is ambiguous, could be: [`DEPARTMENT_NAME`, `DEPARTMENT_NAME`]. SQLSTATE: 42704 ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org