Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Antoine Pitrou Tue, 14 Dec 2021 03:08:55 -0800


Hello,

I think my main concern is how we can prevent the community fromfragmenting too much over supported encodings. The more complex theencodings, the less likely they are to be supported by all mainimplementations. We see this in Parquet where the efficient "delta"encodings have just received support in Parquet C++, and even, only onthe read side.

There is an additional subtlety in that Arrow is not a storage mechanismbut it represents data in memory, so pieces doing computation have to beadapted to the new encodings, for example the entire library ofcomputation kernels in Arrow C++ (of course, an easy but inefficientadaptation is to always unpack to an already supported layout).

As an anecdote, the Arrow C++ kernels are supposed to accept a selectionvector to filter their physical inputs, but none actually supports it.I think we should be wary of adding ambitious new features that mightnever get an actual implementation.



On the detail of the proposed encodings:

- I hope we can avoid storing raw pointers instead of offsets into aseparate buffer; I understand the flexibility argument for pointers butit will also make data transfer more complicated

- Constant arrays are a special case of RLE arrays and I'm not suredoing both is really useful

- I don't really understand the concrete use case for the weird"sequence view" layout; I'll note that non-monotonic offsets can makelinear traversal less efficient, since the CPU won't automaticallyprefetch data for you

- The proposed RLE encoding seems inefficient; usually, RLE encodingstry hard to minimize the size overhead of RLE sequences, such that theybecome beneficial even for very short repeated runs


Regards

Antoine.




Le 10/12/2021 à 20:28, Wes McKinney a écrit :


This topic may provoke , but, given that Arrow is approaching its
6-year anniversary, I think this is an important discussion about how
we can thoughtfully expand the Arrow specifications to support
next-generation columnar data processing. In recent times, I have been
motivated by recent interactions with CWI's DuckDB and Meta's Velox
open source projects and the innovations they've made around data
representation providing beneficial features above and beyond what we
have already in Arrow. For example, they have a 16-byte "string view"
data type that enables buffer memory reuse, faster "false" comparisons
on strings unequal in the first 4 bytes, and inline small strings.
Both the Rust and C++ query engine efforts could potentially benefit
from this (not sure about the memory safety implications in Rust,
comments around this would be helpful).

I wrote a document to start a discussion about a few new ways to
represent data that may help with building
Arrow-native/Arrow-compatible query engines:

https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#

Each of these potential additions would need to be eventually split
off into independent efforts with associated additions to the columnar
specification, IPC format, C ABI, integration tests, and so on.

The document is open to anyone to comment but if anyone would like
edit access please feel free to request and I look forward to the
discussion.

Thanks,
Wes

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to