HyukjinKwon commented on code in PR #50301: URL: https://github.com/apache/spark/pull/50301#discussion_r2009374157
########## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala: ########## @@ -83,17 +89,37 @@ private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[ throw writer.exception.get } try { - if (reader != null && batchLoaded) { + if (batchLoaded && rowCount > 0 && currentRowIdx < rowCount) { + val batchRoot = if (arrowMaxRecordsPerOutputBatch > 0) { + val remainingRows = rowCount - currentRowIdx + if (remainingRows > arrowMaxRecordsPerOutputBatch) { + root.slice(currentRowIdx, arrowMaxRecordsPerOutputBatch) + } else { + root + } + } else { + root + } + + currentRowIdx = currentRowIdx + batchRoot.getRowCount + + vectors = batchRoot.getFieldVectors().asScala.map { vector => + new ArrowColumnVector(vector) + }.toArray[ColumnVector] + + val batch = new ColumnarBatch(vectors) + batch.setNumRows(batchRoot.getRowCount) + deserializeColumnarBatch(batch, schema) Review Comment: I think slicing by bytes is preferred always .. But I have to agree that reusing code path is good, and this patch is minimized. I am fine with going ahead with this for now, but I would like to make sure the code change is isolated, and the configuration is internal so we can refactor things more easily to support byte output, etc .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org