Thank you for writing this down Wes

I think my project is very interested in the RLE encoding and constant
view.

The StringView, as written, seems fairly tightly tied to C/C++, though I
may be mistaken. I think allowing Rust to consume such StringViews would be
possible but it seems very unlikely the Rust implementation would be able
to generate the layout with `char*` type pointers with any sort of
reasonable safety.

> With dictionary and string view, I feel like rle is less important.

While dictionaries certainly help, for sorted low cardinality data (e.g. 1
Million values of 4 distinct strings) the benefits of RLE for compression
and processing performance is arbitrarily enormous. I say the benefits are
arbitrarily enormous because one can encode ~ an arbitrary number of rows
in a constant number of RLE runs.

Low cardinality string datasets appear commonly in timeseries data (for
example, "AWS region name" field on monitoring data)

Andrew

On Fri, Dec 10, 2021 at 3:18 PM Jacques Nadeau <jacq...@apache.org> wrote:

> I'm strongly in support of much of this. Thanks for bringing this up. It is
> long overdue.
>
> On initial read, my thoughts would be:
>
> Stongly inclined:
> - String view
> - constant view
>
> Weakly inclined
> - All null
> - rle
>
> Somewhat disinclined
> - Sequence change
>
>
> With dictionary and string view, I feel like rle is less important.
>
> I'm not yet seeing huge benefit for sequence change.
>
> On Fri, Dec 10, 2021, 11:29 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hello all,
> >
> > This topic may provoke , but, given that Arrow is approaching its
> > 6-year anniversary, I think this is an important discussion about how
> > we can thoughtfully expand the Arrow specifications to support
> > next-generation columnar data processing. In recent times, I have been
> > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > open source projects and the innovations they've made around data
> > representation providing beneficial features above and beyond what we
> > have already in Arrow. For example, they have a 16-byte "string view"
> > data type that enables buffer memory reuse, faster "false" comparisons
> > on strings unequal in the first 4 bytes, and inline small strings.
> > Both the Rust and C++ query engine efforts could potentially benefit
> > from this (not sure about the memory safety implications in Rust,
> > comments around this would be helpful).
> >
> > I wrote a document to start a discussion about a few new ways to
> > represent data that may help with building
> > Arrow-native/Arrow-compatible query engines:
> >
> >
> >
> https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> >
> > Each of these potential additions would need to be eventually split
> > off into independent efforts with associated additions to the columnar
> > specification, IPC format, C ABI, integration tests, and so on.
> >
> > The document is open to anyone to comment but if anyone would like
> > edit access please feel free to request and I look forward to the
> > discussion.
> >
> > Thanks,
> > Wes
> >
>

Reply via email to