Re: Help in reconciling how arrow helps with columnar processing?

Wes McKinney Mon, 04 Dec 2017 07:54:11 -0800

hi Matan,

I recommend this presentation for a detailed discussion of these
points: 
https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow

To your questions:

1. Arrow's "fully shredded" columnar representation ensures a few things

* Reliable data locality for scan operations on all data types (for
example, consecutive strings in Arrow are guaranteed to be next to
each other in memory)
* Contiguous memory plus buffer alignment / padding permits consistent
use of SIMD, if available

2. We have developed a zero-copy messaging / IPC framework that
enables interacting with arbitrary-size Arrow memory in any virtual
address space without copying or deserialization -- in brief, a
dataset is accompanied by a metadata descriptor (serialized using the
Google Flatbuffers library) that indicates the locations of each
memory block constituting a particular column in a particular table.
So we can locate the memory offset corresponding to a particular cell
in a dataset in O(1) time, and scan data in shared memory / memory
maps without having to materialize copies in RAM

See http://arrow.apache.org/docs/ipc.html for a discussion of the
messaging protocol. Earlier this year I wrote about how this enables
very fast movement of streaming tabular data in
http://wesmckinney.com/blog/arrow-streaming-columnar/

Thanks
Wes

On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire <lem...@gmail.com> wrote:
> I don't know the answer per se but my understanding is that
> Arrow enables ccmputational kernels that can be highly optimized.
> I plan to do some work in this direction myself.
>
> - Daniel
>
>
> Hi,
>>
>> I wonder if anyone can comment on how does Apache Arrow accomplish, or help
>> accomplish the following, taken from the Apache page
>> <http://arrow.apache.org/>:
>>
>> Apache Arrow™ enables execution engines to take advantage of the latest
>> SIMD (Single input multiple data) operations included in modern processors,
>> for native vectorized optimization of analytical data processing. Columnar
>> layout is optimized for data locality for better performance on modern
>> hardware like CPUs and GPUs.
>>
>> The Arrow memory format supports *zero-copy reads* for lightning-fast data
>> access without serialization overhead.
>> Can anyone provide information concerning how the standard specifically
>> helps with those concerns, in particular the ones highlighted above?
>>
>> Disclaimer: I've not read the source or the source of the related repos.
>>
>> Many thanks!
>> Matan
>>

Re: Help in reconciling how arrow helps with columnar processing?

Reply via email to