Thanks Wes, This makes really a lot of sense, and I'll keep the references for my reference!
Matan Sent from my iPad > On 4 Dec 2017, at 17:52, Wes McKinney <wesmck...@gmail.com> wrote: > > hi Matan, > > I recommend this presentation for a detailed discussion of these > points: > https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow > > To your questions: > > 1. Arrow's "fully shredded" columnar representation ensures a few things > > * Reliable data locality for scan operations on all data types (for > example, consecutive strings in Arrow are guaranteed to be next to > each other in memory) > * Contiguous memory plus buffer alignment / padding permits consistent > use of SIMD, if available > > 2. We have developed a zero-copy messaging / IPC framework that > enables interacting with arbitrary-size Arrow memory in any virtual > address space without copying or deserialization -- in brief, a > dataset is accompanied by a metadata descriptor (serialized using the > Google Flatbuffers library) that indicates the locations of each > memory block constituting a particular column in a particular table. > So we can locate the memory offset corresponding to a particular cell > in a dataset in O(1) time, and scan data in shared memory / memory > maps without having to materialize copies in RAM > > See http://arrow.apache.org/docs/ipc.html for a discussion of the > messaging protocol. Earlier this year I wrote about how this enables > very fast movement of streaming tabular data in > http://wesmckinney.com/blog/arrow-streaming-columnar/ > > Thanks > Wes > >> On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire <lem...@gmail.com> wrote: >> I don't know the answer per se but my understanding is that >> Arrow enables ccmputational kernels that can be highly optimized. >> I plan to do some work in this direction myself. >> >> - Daniel >> >> >> Hi, >>> >>> I wonder if anyone can comment on how does Apache Arrow accomplish, or help >>> accomplish the following, taken from the Apache page >>> <http://arrow.apache.org/>: >>> >>> Apache Arrow™ enables execution engines to take advantage of the latest >>> SIMD (Single input multiple data) operations included in modern processors, >>> for native vectorized optimization of analytical data processing. Columnar >>> layout is optimized for data locality for better performance on modern >>> hardware like CPUs and GPUs. >>> >>> The Arrow memory format supports *zero-copy reads* for lightning-fast data >>> access without serialization overhead. >>> Can anyone provide information concerning how the standard specifically >>> helps with those concerns, in particular the ones highlighted above? >>> >>> Disclaimer: I've not read the source or the source of the related repos. >>> >>> Many thanks! >>> Matan >>>