hi Matan, I recommend this presentation for a detailed discussion of these points: https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow
To your questions: 1. Arrow's "fully shredded" columnar representation ensures a few things * Reliable data locality for scan operations on all data types (for example, consecutive strings in Arrow are guaranteed to be next to each other in memory) * Contiguous memory plus buffer alignment / padding permits consistent use of SIMD, if available 2. We have developed a zero-copy messaging / IPC framework that enables interacting with arbitrary-size Arrow memory in any virtual address space without copying or deserialization -- in brief, a dataset is accompanied by a metadata descriptor (serialized using the Google Flatbuffers library) that indicates the locations of each memory block constituting a particular column in a particular table. So we can locate the memory offset corresponding to a particular cell in a dataset in O(1) time, and scan data in shared memory / memory maps without having to materialize copies in RAM See http://arrow.apache.org/docs/ipc.html for a discussion of the messaging protocol. Earlier this year I wrote about how this enables very fast movement of streaming tabular data in http://wesmckinney.com/blog/arrow-streaming-columnar/ Thanks Wes On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire <lem...@gmail.com> wrote: > I don't know the answer per se but my understanding is that > Arrow enables ccmputational kernels that can be highly optimized. > I plan to do some work in this direction myself. > > - Daniel > > > Hi, >> >> I wonder if anyone can comment on how does Apache Arrow accomplish, or help >> accomplish the following, taken from the Apache page >> <http://arrow.apache.org/>: >> >> Apache Arrow™ enables execution engines to take advantage of the latest >> SIMD (Single input multiple data) operations included in modern processors, >> for native vectorized optimization of analytical data processing. Columnar >> layout is optimized for data locality for better performance on modern >> hardware like CPUs and GPUs. >> >> The Arrow memory format supports *zero-copy reads* for lightning-fast data >> access without serialization overhead. >> Can anyone provide information concerning how the standard specifically >> helps with those concerns, in particular the ones highlighted above? >> >> Disclaimer: I've not read the source or the source of the related repos. >> >> Many thanks! >> Matan >>