> * Should we encode "run lengths" or "run ends"? In addition to the points mentioned above, this seems the most consistent with the variable-length binary/list layouts
> encoding the run ends as a buffer (similar to list array for example) makes it difficult to calculate offsets I don't have a strong opinion about this, but I also don't understand the logic. Surely the implementation is just generating/reading a buffer of integers and there's some overhead/indirection to wrapping it in an Array (that must then be validated). As a matter of curiosity, was a dictionary approach ever considered? If one new layout was added (one buffer containing the run ends of a RLE 0:N int32 array), the dictionary member could be the values array and perhaps make it easier for implementations that already handle dictionaries. On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol <m...@voltrondata.com.invalid> wrote: > Just wanted to chime in here that I also have several draft PRs for > implementing the RLE arrays in Go as the second implementation (since > we use two implementations as a requirement to vote on > changes/additions to the format). > > They can be found here: > > <https://github.com/apache/arrow/pull/14111> > <https://github.com/apache/arrow/pull/14114> > <https://github.com/apache/arrow/pull/14126> > > --Matt > > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield > <emkornfi...@gmail.com> wrote: > >> > >> * Should we encode "run lengths" or "run ends"? > > > > > > I think the project has leaned towards sublinear access, so run ends > > make > > sense. The downside is that we run into similar issues with > > List/LargeList > > where the total number of elements is limited by bit-width (which can > > also > > cause space wastage, e.g. with run ends it might be reasonable to > > limit > > bit-width to 16). > > > > The values are definitely a child array. However, encoding the run > >> ends as a buffer (similar to list array for example) makes it > >> difficult to calculate offsets. Translating an array offset to a > >> buffer offset takes O(log(N)) time. If the run ends are encoded as > >> a > >> child array (so the RLE array has no buffers and two child arrays) > >> then this problem goes away. > > > > > > I'm not sure I understand this, could you provide an example of the > > problem > > that the child array solves? > > > > > > > > > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.p...@gmail.com > > <mailto:weston.p...@gmail.com>> wrote: > > > >> I'm going to bump this because it would be good to get feedback. In > >> particular it would be nice to get feedback on the suggested format > >> change[1]. We are currently moving forward on coming up with an IPC > >> format proposal which we will share when ready. > >> > >> The two interesting points that jump out to me are: > >> > >> * Should we encode "run lengths" or "run ends"? > >> > >> For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" > >> 3, > >> 2, 4 or "run ends" 3, 5, 9. In the proposal the latter is preferred > >> as that leads to O(log(N)) random access (instead of O(N)) and it's > >> not clear there are any disadvantages. > >> > >> * Should the run ends be encoded as a buffer or a child array? > >> > >> The values are definitely a child array. However, encoding the run > >> ends as a buffer (similar to list array for example) makes it > >> difficult to calculate offsets. Translating an array offset to a > >> buffer offset takes O(log(N)) time. If the run ends are encoded as > >> a > >> child array (so the RLE array has no buffers and two child arrays) > >> then this problem goes away. > >> > >> [1] <https://github.com/apache/arrow/pull/13333/files> > >> > >> On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni > >> <tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>> > >> wrote: > >> > > >> > Hello Everyone, > >> > > >> > Recently, I have implemented support for run-length encoding in > >> Arrow > >> > C++. So far my implementation is split into different subtasks of > >> > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>). > >> > > >> > I have (draft) PRs available for: > >> > - general handling of RLE in arrow C++, Type, Arrow, Builder > >> > subclasses, etc. > >> > (subtasks 1-9) > >> > - encode, decode kernels (fixed size only): > >> > (<https://issues.apache.org/jira/browse/ARROW-16772>) > >> > - filter kernel (fixed size only): > >> > (<https://issues.apache.org/jira/browse/ARROW-16774>) > >> > - simple benchmark for the RLE kernels > >> > (<https://issues.apache.org/jira/browse/ARROW-17026>) > >> > - adding RLE to Arrow Columnar format document > >> > (<https://issues.apache.org/jira/browse/ARROW-16773>) > >> > > >> > What is not yet implemented: > >> > - converting RLE to formats like Parquet, JSON, IPC. > >> > - caching of physical offsets when working with sliced arrays - > >> finding > >> > these offsets is an O(log(n)) binary search which could be > >> avoided in > >> > a lot of cases > >> > > >> > I'm interested in any feedback on the code and I'm wondering what > >> would > >> > be the best way to get this merged. > >> > > >> > A lot of the PRs depend on earlier ones. I ordered the subtasks > >> in a > >> > way they could be merged. The first would be: > >> > 1. Handling of array-only types using VisitTypeInline: > >> > <https://issues.apache.org/jira/browse/ARROW-17258> > >> > 2. Adding RLE type / array class (only builds on #1): > >> > <https://issues.apache.org/jira/browse/ARROW-17261> > >> > - also, since it has no dependencies: adding RLE to Arrow > >> Columnar > >> > format document > >> > <https://issues.apache.org/jira/browse/ARROW-16773> > >> > > >> > Best, > >> > Tobias > >> > >