Re: Help in reconciling how arrow helps with columnar processing?

Matan Safriel Mon, 04 Dec 2017 22:22:51 -0800

Thanks Wes,

This makes really a lot of sense, and I'll keep the references for my reference!


Matan

Sent from my iPad

> On 4 Dec 2017, at 17:52, Wes McKinney <[email protected]> wrote:
> 
> hi Matan,
> 
> I recommend this presentation for a detailed discussion of these
> points: 
> https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow
> 
> To your questions:
> 
> 1. Arrow's "fully shredded" columnar representation ensures a few things
> 
> * Reliable data locality for scan operations on all data types (for
> example, consecutive strings in Arrow are guaranteed to be next to
> each other in memory)
> * Contiguous memory plus buffer alignment / padding permits consistent
> use of SIMD, if available
> 
> 2. We have developed a zero-copy messaging / IPC framework that
> enables interacting with arbitrary-size Arrow memory in any virtual
> address space without copying or deserialization -- in brief, a
> dataset is accompanied by a metadata descriptor (serialized using the
> Google Flatbuffers library) that indicates the locations of each
> memory block constituting a particular column in a particular table.
> So we can locate the memory offset corresponding to a particular cell
> in a dataset in O(1) time, and scan data in shared memory / memory
> maps without having to materialize copies in RAM
> 
> See http://arrow.apache.org/docs/ipc.html for a discussion of the
> messaging protocol. Earlier this year I wrote about how this enables
> very fast movement of streaming tabular data in
> http://wesmckinney.com/blog/arrow-streaming-columnar/
> 
> Thanks
> Wes
> 
>> On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire <[email protected]> wrote:
>> I don't know the answer per se but my understanding is that
>> Arrow enables ccmputational kernels that can be highly optimized.
>> I plan to do some work in this direction myself.
>> 
>> - Daniel
>> 
>> 
>> Hi,
>>> 
>>> I wonder if anyone can comment on how does Apache Arrow accomplish, or help
>>> accomplish the following, taken from the Apache page
>>> <http://arrow.apache.org/>:
>>> 
>>> Apache Arrow™ enables execution engines to take advantage of the latest
>>> SIMD (Single input multiple data) operations included in modern processors,
>>> for native vectorized optimization of analytical data processing. Columnar
>>> layout is optimized for data locality for better performance on modern
>>> hardware like CPUs and GPUs.
>>> 
>>> The Arrow memory format supports *zero-copy reads* for lightning-fast data
>>> access without serialization overhead.
>>> Can anyone provide information concerning how the standard specifically
>>> helps with those concerns, in particular the ones highlighted above?
>>> 
>>> Disclaimer: I've not read the source or the source of the related repos.
>>> 
>>> Many thanks!
>>> Matan
>>>

Re: Help in reconciling how arrow helps with columnar processing?

Reply via email to