I'm strongly in support of much of this. Thanks for bringing this up. It is
long overdue.

On initial read, my thoughts would be:

Stongly inclined:
- String view
- constant view

Weakly inclined
- All null
- rle

Somewhat disinclined
- Sequence change


With dictionary and string view, I feel like rle is less important.

I'm not yet seeing huge benefit for sequence change.

On Fri, Dec 10, 2021, 11:29 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hello all,
>
> This topic may provoke , but, given that Arrow is approaching its
> 6-year anniversary, I think this is an important discussion about how
> we can thoughtfully expand the Arrow specifications to support
> next-generation columnar data processing. In recent times, I have been
> motivated by recent interactions with CWI's DuckDB and Meta's Velox
> open source projects and the innovations they've made around data
> representation providing beneficial features above and beyond what we
> have already in Arrow. For example, they have a 16-byte "string view"
> data type that enables buffer memory reuse, faster "false" comparisons
> on strings unequal in the first 4 bytes, and inline small strings.
> Both the Rust and C++ query engine efforts could potentially benefit
> from this (not sure about the memory safety implications in Rust,
> comments around this would be helpful).
>
> I wrote a document to start a discussion about a few new ways to
> represent data that may help with building
> Arrow-native/Arrow-compatible query engines:
>
>
> https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
>
> Each of these potential additions would need to be eventually split
> off into independent efforts with associated additions to the columnar
> specification, IPC format, C ABI, integration tests, and so on.
>
> The document is open to anyone to comment but if anyone would like
> edit access please feel free to request and I look forward to the
> discussion.
>
> Thanks,
> Wes
>

Reply via email to