Re: [PR] [SPARK-51479][SQL] Nullable in Row Level Operation Column is not correct [spark]

via GitHub Mon, 31 Mar 2025 15:46:39 -0700


huaxingao commented on code in PR #50246:
URL: https://github.com/apache/spark/pull/50246#discussion_r2021813936



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala:
##########
@@ -273,9 +273,8 @@ trait RewriteRowLevelCommand extends Rule[LogicalPlan] {
       outputs: Seq[Seq[Expression]],
       colOrdinals: Seq[Int],
       attrs: Seq[Attribute]): ProjectingInternalRow = {
-    val schema = StructType(attrs.zipWithIndex.map { case (attr, index) =>
-      val nullable = outputs.exists(output => 
output(colOrdinals(index)).nullable)
-      StructField(attr.name, attr.dataType, nullable, attr.metadata)
+    val schema = StructType(attrs.zipWithIndex.map { case (attr, _) =>
+      StructField(attr.name, attr.dataType, attr.nullable, attr.metadata)

Review Comment:
   Thanks @amogh-jahagirdar for your comment! I took a closer look at why the 
test passed in Spark 3.4 extension, but failed with Spark 4.0.
   In Spark 3.4 extension, when building the `metadataProjection`, we are using 
[updateAndDeleteOutputs](https://github.com/apache/iceberg/blob/main/spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelIcebergCommand.scala#L94),
 which does not contain the 
[INSERT_OPERATION](https://github.com/apache/iceberg/blob/main/spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelIcebergCommand.scala#L83C70-L83C86)
   <img width="499" alt="Screenshot 2025-03-31 at 2 06 21 PM" 
src="https://github.com/user-attachments/assets/3d14922d-6b22-4ea0-ac6a-16c0cb22e353";
 />
   in which _spec_id has nullable false, and _partition has nullable true.
   
   In Spark4.0, when building `metadataProjection`, we are using 
[outputsWithMetadata](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala#L196),
 which contains 
[REINSERT_OPERATION](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala#L44),
 so the outputs contains two rows
   <img width="540" alt="Screenshot 2025-03-31 at 2 32 31 PM" 
src="https://github.com/user-attachments/assets/bf7e2a3d-6710-4a3d-bc56-33b15453d7c0";
 />
   Since the second row has null for both _spec_id and _partition, the 
calculated nullable for both the metadata columns are true, which led the 
schema verification for 
[MetadataSchema](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java#L109)
 failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51479][SQL] Nullable in Row Level Operation Column is not correct [spark]

Reply via email to