Re: [PR] [SPARK-XXXXX][SQL] Add maxRecordsPerOutputBatch to limit the number of record of Arrow output batch [spark]

via GitHub Fri, 04 Apr 2025 22:47:45 -0700


viirya commented on code in PR #50301:
URL: https://github.com/apache/spark/pull/50301#discussion_r2029710907



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3391,6 +3391,19 @@ object SQLConf {
       .intConf
       .createWithDefault(10000)
 
+  val ARROW_EXECUTION_MAX_RECORDS_PER_OUTPUT_BATCH =
+    buildConf("spark.sql.execution.arrow.maxRecordsPerOutputBatch")

Review Comment:
   Vectorized engines usually have a maximum batch size setting which prevents 
too big batch as input that possibly causes memory issue. For the downstream 
operators if they are custom vectorized operators, currently the Arrow output 
batch is sent to them as input. When user worries that the output batch might 
be too big for these operators, the user can set this config to limit the 
output batch size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-XXXXX][SQL] Add maxRecordsPerOutputBatch to limit the number of record of Arrow output batch [spark]

Reply via email to