Thank you for writing this down Wes I think my project is very interested in the RLE encoding and constant view.
The StringView, as written, seems fairly tightly tied to C/C++, though I may be mistaken. I think allowing Rust to consume such StringViews would be possible but it seems very unlikely the Rust implementation would be able to generate the layout with `char*` type pointers with any sort of reasonable safety. > With dictionary and string view, I feel like rle is less important. While dictionaries certainly help, for sorted low cardinality data (e.g. 1 Million values of 4 distinct strings) the benefits of RLE for compression and processing performance is arbitrarily enormous. I say the benefits are arbitrarily enormous because one can encode ~ an arbitrary number of rows in a constant number of RLE runs. Low cardinality string datasets appear commonly in timeseries data (for example, "AWS region name" field on monitoring data) Andrew On Fri, Dec 10, 2021 at 3:18 PM Jacques Nadeau <jacq...@apache.org> wrote: > I'm strongly in support of much of this. Thanks for bringing this up. It is > long overdue. > > On initial read, my thoughts would be: > > Stongly inclined: > - String view > - constant view > > Weakly inclined > - All null > - rle > > Somewhat disinclined > - Sequence change > > > With dictionary and string view, I feel like rle is less important. > > I'm not yet seeing huge benefit for sequence change. > > On Fri, Dec 10, 2021, 11:29 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hello all, > > > > This topic may provoke , but, given that Arrow is approaching its > > 6-year anniversary, I think this is an important discussion about how > > we can thoughtfully expand the Arrow specifications to support > > next-generation columnar data processing. In recent times, I have been > > motivated by recent interactions with CWI's DuckDB and Meta's Velox > > open source projects and the innovations they've made around data > > representation providing beneficial features above and beyond what we > > have already in Arrow. For example, they have a 16-byte "string view" > > data type that enables buffer memory reuse, faster "false" comparisons > > on strings unequal in the first 4 bytes, and inline small strings. > > Both the Rust and C++ query engine efforts could potentially benefit > > from this (not sure about the memory safety implications in Rust, > > comments around this would be helpful). > > > > I wrote a document to start a discussion about a few new ways to > > represent data that may help with building > > Arrow-native/Arrow-compatible query engines: > > > > > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# > > > > Each of these potential additions would need to be eventually split > > off into independent efforts with associated additions to the columnar > > specification, IPC format, C ABI, integration tests, and so on. > > > > The document is open to anyone to comment but if anyone would like > > edit access please feel free to request and I look forward to the > > discussion. > > > > Thanks, > > Wes > > >