Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

via GitHub Mon, 13 Jan 2025 07:34:36 -0800


andygrove commented on PR #1190:
URL: 
https://github.com/apache/datafusion-comet/pull/1190#issuecomment-2587446533


   > FWIW this encoding format is almost identical to the IPC format AFAICT 
with only some minor changes to the metadata encoding. The exact same 
validation is done when reading an array, and both preserve the buffers as is1..
   > 
   > I suspect that much of the overhead is fixed overheads in StreamWriter, 
e.g. encoding the schema, and that these could be optimised and/or eliminated 
by using the lower-level APIs such as 
[write_message](https://docs.rs/arrow-ipc/latest/arrow_ipc/writer/fn.write_message.html)
 and 
[root_as_message](https://docs.rs/arrow-ipc/latest/arrow_ipc/gen/Message/fn.root_as_message.html).
 The benchmarks at least appear to agree with this, with a relatively fixed 
performance delta on the order of 10s of microseconds between the two encoders.
   > 
   > Just thought I'd put it out there, there is likely low hanging fruit in 
the arrow-rs IPC implementation and improvements there would definitely be 
welcome not just by myself.
   > 
   
   Thanks for the feedback @tustvold. I am following the current efforts to 
optimize Arrow IPC and will try to help.
   
   > 1. _This is actually not entirely true, when dealing with sliced arrays, 
the arrow IPC implementation has additional logic to avoid encoding data that 
isn't referenced_
   
   For the Comet use case, the batches that we are writing have always been 
freshly created and are guaranteed to not have been sliced, so we can get away 
with this approach.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

Reply via email to