> > * Should we encode "run lengths" or "run ends"?
I think the project has leaned towards sublinear access, so run ends make sense. The downside is that we run into similar issues with List/LargeList where the total number of elements is limited by bit-width (which can also cause space wastage, e.g. with run ends it might be reasonable to limit bit-width to 16). The values are definitely a child array. However, encoding the run > ends as a buffer (similar to list array for example) makes it > difficult to calculate offsets. Translating an array offset to a > buffer offset takes O(log(N)) time. If the run ends are encoded as a > child array (so the RLE array has no buffers and two child arrays) > then this problem goes away. I'm not sure I understand this, could you provide an example of the problem that the child array solves? On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.p...@gmail.com> wrote: > I'm going to bump this because it would be good to get feedback. In > particular it would be nice to get feedback on the suggested format > change[1]. We are currently moving forward on coming up with an IPC > format proposal which we will share when ready. > > The two interesting points that jump out to me are: > > * Should we encode "run lengths" or "run ends"? > > For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 3, > 2, 4 or "run ends" 3, 5, 9. In the proposal the latter is preferred > as that leads to O(log(N)) random access (instead of O(N)) and it's > not clear there are any disadvantages. > > * Should the run ends be encoded as a buffer or a child array? > > The values are definitely a child array. However, encoding the run > ends as a buffer (similar to list array for example) makes it > difficult to calculate offsets. Translating an array offset to a > buffer offset takes O(log(N)) time. If the run ends are encoded as a > child array (so the RLE array has no buffers and two child arrays) > then this problem goes away. > > [1] https://github.com/apache/arrow/pull/13333/files > > On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni > <tob...@zagorni.eu.invalid> wrote: > > > > Hello Everyone, > > > > Recently, I have implemented support for run-length encoding in Arrow > > C++. So far my implementation is split into different subtasks of > > ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771). > > > > I have (draft) PRs available for: > > - general handling of RLE in arrow C++, Type, Arrow, Builder > > subclasses, etc. > > (subtasks 1-9) > > - encode, decode kernels (fixed size only): > > (https://issues.apache.org/jira/browse/ARROW-16772) > > - filter kernel (fixed size only): > > (https://issues.apache.org/jira/browse/ARROW-16774) > > - simple benchmark for the RLE kernels > > (https://issues.apache.org/jira/browse/ARROW-17026) > > - adding RLE to Arrow Columnar format document > > (https://issues.apache.org/jira/browse/ARROW-16773) > > > > What is not yet implemented: > > - converting RLE to formats like Parquet, JSON, IPC. > > - caching of physical offsets when working with sliced arrays - finding > > these offsets is an O(log(n)) binary search which could be avoided in > > a lot of cases > > > > I'm interested in any feedback on the code and I'm wondering what would > > be the best way to get this merged. > > > > A lot of the PRs depend on earlier ones. I ordered the subtasks in a > > way they could be merged. The first would be: > > 1. Handling of array-only types using VisitTypeInline: > > https://issues.apache.org/jira/browse/ARROW-17258 > > 2. Adding RLE type / array class (only builds on #1): > > https://issues.apache.org/jira/browse/ARROW-17261 > > - also, since it has no dependencies: adding RLE to Arrow Columnar > > format document > > https://issues.apache.org/jira/browse/ARROW-16773 > > > > Best, > > Tobias >