[I] Stop encoding schema with each batch in shuffle writer [datafusion-comet]

via GitHub Thu, 19 Dec 2024 09:33:03 -0800


andygrove opened a new issue, #1186:
URL: https://github.com/apache/datafusion-comet/issues/1186


   ### What is the problem the feature request solves?
   
   We use Arrow IPC to write shuffle output. We create a new writer for each 
batch and this means that we seralize the schema for each batch.
   
   ```rust
   let mut arrow_writer = StreamWriter::try_new(zstd::Encoder::new(output, 1)?, 
&batch.schema())?;
   arrow_writer.write(batch)?;
   arrow_writer.finish()?;
   ```
   
   The schema is guaranteed to be the same for every batch so we should be able 
to use a single writer for all batches and avoid the cost of serializing the 
schema each time.
   
   Based on one benchmarks in 
https://github.com/apache/datafusion-comet/pull/1180 I am seeing a 4x speedup 
in encoding time by re-using the writer.
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Stop encoding schema with each batch in shuffle writer [datafusion-comet]

Reply via email to