Hello,

I think my main concern is how we can prevent the community from fragmenting too much over supported encodings. The more complex the encodings, the less likely they are to be supported by all main implementations. We see this in Parquet where the efficient "delta" encodings have just received support in Parquet C++, and even, only on the read side.

There is an additional subtlety in that Arrow is not a storage mechanism but it represents data in memory, so pieces doing computation have to be adapted to the new encodings, for example the entire library of computation kernels in Arrow C++ (of course, an easy but inefficient adaptation is to always unpack to an already supported layout).

As an anecdote, the Arrow C++ kernels are supposed to accept a selection vector to filter their physical inputs, but none actually supports it. I think we should be wary of adding ambitious new features that might never get an actual implementation.


On the detail of the proposed encodings:

- I hope we can avoid storing raw pointers instead of offsets into a separate buffer; I understand the flexibility argument for pointers but it will also make data transfer more complicated

- Constant arrays are a special case of RLE arrays and I'm not sure doing both is really useful

- I don't really understand the concrete use case for the weird "sequence view" layout; I'll note that non-monotonic offsets can make linear traversal less efficient, since the CPU won't automatically prefetch data for you

- The proposed RLE encoding seems inefficient; usually, RLE encodings try hard to minimize the size overhead of RLE sequences, such that they become beneficial even for very short repeated runs

Regards

Antoine.




Le 10/12/2021 à 20:28, Wes McKinney a écrit :

This topic may provoke , but, given that Arrow is approaching its
6-year anniversary, I think this is an important discussion about how
we can thoughtfully expand the Arrow specifications to support
next-generation columnar data processing. In recent times, I have been
motivated by recent interactions with CWI's DuckDB and Meta's Velox
open source projects and the innovations they've made around data
representation providing beneficial features above and beyond what we
have already in Arrow. For example, they have a 16-byte "string view"
data type that enables buffer memory reuse, faster "false" comparisons
on strings unequal in the first 4 bytes, and inline small strings.
Both the Rust and C++ query engine efforts could potentially benefit
from this (not sure about the memory safety implications in Rust,
comments around this would be helpful).

I wrote a document to start a discussion about a few new ways to
represent data that may help with building
Arrow-native/Arrow-compatible query engines:

https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#

Each of these potential additions would need to be eventually split
off into independent efforts with associated additions to the columnar
specification, IPC format, C ABI, integration tests, and so on.

The document is open to anyone to comment but if anyone would like
edit access please feel free to request and I look forward to the
discussion.

Thanks,
Wes

Reply via email to