andygrove commented on PR #1190: URL: https://github.com/apache/datafusion-comet/pull/1190#issuecomment-2587446533
> FWIW this encoding format is almost identical to the IPC format AFAICT with only some minor changes to the metadata encoding. The exact same validation is done when reading an array, and both preserve the buffers as is1.. > > I suspect that much of the overhead is fixed overheads in StreamWriter, e.g. encoding the schema, and that these could be optimised and/or eliminated by using the lower-level APIs such as [write_message](https://docs.rs/arrow-ipc/latest/arrow_ipc/writer/fn.write_message.html) and [root_as_message](https://docs.rs/arrow-ipc/latest/arrow_ipc/gen/Message/fn.root_as_message.html). The benchmarks at least appear to agree with this, with a relatively fixed performance delta on the order of 10s of microseconds between the two encoders. > > Just thought I'd put it out there, there is likely low hanging fruit in the arrow-rs IPC implementation and improvements there would definitely be welcome not just by myself. > Thanks for the feedback @tustvold. I am following the current efforts to optimize Arrow IPC and will try to help. > 1. _This is actually not entirely true, when dealing with sliced arrays, the arrow IPC implementation has additional logic to avoid encoding data that isn't referenced_ For the Comet use case, the batches that we are writing have always been freshly created and are guaranteed to not have been sliced, so we can get away with this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org