Re: [PR] [SPARK-XXXXX][SQL] Add maxRecordsPerOutputBatch to limit the number of record of Arrow output batch [spark]

via GitHub Tue, 18 Mar 2025 12:37:53 -0700


viirya commented on code in PR #50301:
URL: https://github.com/apache/spark/pull/50301#discussion_r2001863178



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -83,17 +89,37 @@ private[python] trait PythonArrowOutput[OUT <: AnyRef] { 
self: BasePythonRunner[
           throw writer.exception.get
         }
         try {
-          if (reader != null && batchLoaded) {
+          if (batchLoaded && rowCount > 0 && currentRowIdx < rowCount) {
+            val batchRoot = if (arrowMaxRecordsPerOutputBatch > 0) {
+              val remainingRows = rowCount - currentRowIdx
+              if (remainingRows > arrowMaxRecordsPerOutputBatch) {
+                root.slice(currentRowIdx, arrowMaxRecordsPerOutputBatch)
+              } else {
+                root
+              }
+            } else {
+              root
+            }
+
+            currentRowIdx = currentRowIdx + batchRoot.getRowCount
+
+            vectors = batchRoot.getFieldVectors().asScala.map { vector =>
+              new ArrowColumnVector(vector)
+            }.toArray[ColumnVector]
+
+            val batch = new ColumnarBatch(vectors)
+            batch.setNumRows(batchRoot.getRowCount)
+            deserializeColumnarBatch(batch, schema)

Review Comment:
   Yea, we do want to limit the size by bytes.
   
   Because we need to specify slice range in row index when slicing Arrow 
batch, if we want to do byte-size limit, we probably need to try-and-check the 
byte after incrementally increasing/decreasing the slice range.
   For example, slicing 100 rows -> if it is over limit? Yes, reducing range. 
No, increasing range. Like that.
   
   I am not sure if doing this could have some overhead, although slicing 
doesn't copying memory.
   We can add it alongside with row-size limit. And let user choose between 
them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-XXXXX][SQL] Add maxRecordsPerOutputBatch to limit the number of record of Arrow output batch [spark]

Reply via email to