Re: [PR] [SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows [spark]

via GitHub Tue, 25 Feb 2025 23:24:45 -0800


viirya commented on code in PR #50080:
URL: https://github.com/apache/spark/pull/50080#discussion_r1971063958



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:
##########
@@ -150,6 +160,13 @@ private[arrow] abstract class ArrowFieldWriter {
     valueVector.setValueCount(count)
   }
 
+  def getSizeInBytes(): Int = {
+    valueVector.setValueCount(count)
+    // Before calling getBufferSizeFor, we need to call
+    // `setValueCount`, see 
https://github.com/apache/arrow/pull/9187#issuecomment-763362710
+    valueVector.getBufferSizeFor(count)
+  }

Review Comment:
   Hmm, based on the API doc https://arrow.apache.org/docs/java/vector.html
   
   ```
   After this step, the vector enters an immutable state. In other words, we
   should no longer mutate it. (Unless we reuse the vector by allocating it
   again. This will be discussed shortly.)
   ```
   
   A Java Arrow field vector after called this method, should not be modified. 
But I think this patch will call `getSizeInBytes` during inserting values into 
a vector.
   
   It might cause unexpected error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51316][PYTHON] Allow Arrow batches in bytes instead of number of rows [spark]

Reply via email to