I'm strongly in support of much of this. Thanks for bringing this up. It is long overdue.
On initial read, my thoughts would be: Stongly inclined: - String view - constant view Weakly inclined - All null - rle Somewhat disinclined - Sequence change With dictionary and string view, I feel like rle is less important. I'm not yet seeing huge benefit for sequence change. On Fri, Dec 10, 2021, 11:29 AM Wes McKinney <wesmck...@gmail.com> wrote: > hello all, > > This topic may provoke , but, given that Arrow is approaching its > 6-year anniversary, I think this is an important discussion about how > we can thoughtfully expand the Arrow specifications to support > next-generation columnar data processing. In recent times, I have been > motivated by recent interactions with CWI's DuckDB and Meta's Velox > open source projects and the innovations they've made around data > representation providing beneficial features above and beyond what we > have already in Arrow. For example, they have a 16-byte "string view" > data type that enables buffer memory reuse, faster "false" comparisons > on strings unequal in the first 4 bytes, and inline small strings. > Both the Rust and C++ query engine efforts could potentially benefit > from this (not sure about the memory safety implications in Rust, > comments around this would be helpful). > > I wrote a document to start a discussion about a few new ways to > represent data that may help with building > Arrow-native/Arrow-compatible query engines: > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# > > Each of these potential additions would need to be eventually split > off into independent efforts with associated additions to the columnar > specification, IPC format, C ABI, integration tests, and so on. > > The document is open to anyone to comment but if anyone would like > edit access please feel free to request and I look forward to the > discussion. > > Thanks, > Wes >