Re: [DISCUSS][Java] Will arrow support overall compression at the FieldVector level in the future, rather than separately compressing each ArrowBuffer within the FieldVector

David Li Thu, 20 Feb 2025 16:12:10 -0800

Hi Yunhong,

This isn't a Java issue. The spec for Arrow IPC only supports per-buffer 
compression [1]. It does mention other designs as a potential future 
improvement there. If you think it might be useful, it could be helpful to 
sketch a proposal and/or bring some benchmarks?


Note that most vectors/arrays are only going to have the data buffer and maybe 
a validity buffer, so I'm not sure bundling them together will matter too much? 
Are there more details about the overhead you're seeing/your use case?

[1]: 
https://github.com/apache/arrow/blob/20d8acd89f5ebf87295e08ed10e2f94cb03d57d0/format/Message.fbs#L55-L67

Thanks,
David

On Wed, Feb 19, 2025, at 14:54, yh z wrote:
> Hi, all. Currently, in arrow-java, to do compression for one
> ArrowRecordBatch in VectorUnloader, it will separately compress each
> ArrowBuffer within the FieldVector instead of compress at the FieldVector
> level. From the compression rate perspective, larger batches generally
> result in higher compression rates. Additionally, calling
> compress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) multiple
> times may consume more CPU than call once.
> Therefore, I would like to ask if there will be support for overall
> compression at the FieldVector level, which could improve the compression
> ratio without affecting the ability to read individual columns.
>
> Many thanks,
> Yunhong Zheng

Re: [DISCUSS][Java] Will arrow support overall compression at the FieldVector level in the future, rather than separately compressing each ArrowBuffer within the FieldVector

Reply via email to