viirya commented on code in PR #50301: URL: https://github.com/apache/spark/pull/50301#discussion_r2001863178
########## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala: ########## @@ -83,17 +89,37 @@ private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[ throw writer.exception.get } try { - if (reader != null && batchLoaded) { + if (batchLoaded && rowCount > 0 && currentRowIdx < rowCount) { + val batchRoot = if (arrowMaxRecordsPerOutputBatch > 0) { + val remainingRows = rowCount - currentRowIdx + if (remainingRows > arrowMaxRecordsPerOutputBatch) { + root.slice(currentRowIdx, arrowMaxRecordsPerOutputBatch) + } else { + root + } + } else { + root + } + + currentRowIdx = currentRowIdx + batchRoot.getRowCount + + vectors = batchRoot.getFieldVectors().asScala.map { vector => + new ArrowColumnVector(vector) + }.toArray[ColumnVector] + + val batch = new ColumnarBatch(vectors) + batch.setNumRows(batchRoot.getRowCount) + deserializeColumnarBatch(batch, schema) Review Comment: Yea, we do want to limit the size by bytes. Because we need to specify slice range in row index when slicing Arrow batch, if we want to do byte-size limit, we probably need to try-and-check the byte after incrementally increasing/decreasing the slice range. For example, slicing 100 rows -> if it is over limit? Yes, reducing range. No, increasing range. Like that. I am not sure if doing this could have some overhead, although slicing doesn't copying memory. We can add it alongside with row-size limit. And let user choose between them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org