Re: PRs for RLE support

Dewey Dunnington Wed, 14 Sep 2022 10:53:11 -0700

>  * Should we encode "run lengths" or "run ends"?

In addition to the points mentioned above, this seems the most consistent
with the variable-length binary/list layouts


> encoding the run ends as a buffer (similar to list array for example)
makes it difficult to calculate offsets

I don't have a strong opinion about this, but I also don't understand the
logic. Surely the implementation is just generating/reading a buffer of
integers and there's some overhead/indirection to wrapping it in an Array
(that must then be validated).

As a matter of curiosity, was a dictionary approach ever considered? If one
new layout was added (one buffer containing the run ends of a RLE 0:N int32
array), the dictionary member could be the values array and perhaps make it
easier for implementations that already handle dictionaries.





On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol <m...@voltrondata.com.invalid>
wrote:

> Just wanted to chime in here that I also have several draft PRs for
> implementing the RLE arrays in Go as the second implementation (since
> we use two implementations as a requirement to vote on
> changes/additions to the format).
>
> They can be found here:
>
> <https://github.com/apache/arrow/pull/14111>
> <https://github.com/apache/arrow/pull/14114>
> <https://github.com/apache/arrow/pull/14126>
>
> --Matt
>
> On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> <emkornfi...@gmail.com> wrote:
> >>
> >>   * Should we encode "run lengths" or "run ends"?
> >
> >
> > I think the project has leaned towards sublinear access, so run ends
> > make
> > sense.  The downside is that we run into similar issues with
> > List/LargeList
> > where the total number of elements is limited by bit-width (which can
> > also
> > cause space wastage, e.g. with run ends it might be reasonable to
> > limit
> > bit-width to 16).
> >
> > The values are definitely a child array.  However, encoding the run
> >>  ends as a buffer (similar to list array for example) makes it
> >>  difficult to calculate offsets.  Translating an array offset to a
> >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> >> a
> >>  child array (so the RLE array has no buffers and two child arrays)
> >>  then this problem goes away.
> >
> >
> > I'm not sure I understand this, could you provide an example of the
> > problem
> > that the child array solves?
> >
> >
> >
> >
> > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.p...@gmail.com
> > <mailto:weston.p...@gmail.com>> wrote:
> >
> >>  I'm going to bump this because it would be good to get feedback.  In
> >>  particular it would be nice to get feedback on the suggested format
> >>  change[1].  We are currently moving forward on coming up with an IPC
> >>  format proposal which we will share when ready.
> >>
> >>  The two interesting points that jump out to me are:
> >>
> >>   * Should we encode "run lengths" or "run ends"?
> >>
> >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> >> 3,
> >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> >>  not clear there are any disadvantages.
> >>
> >>   * Should the run ends be encoded as a buffer or a child array?
> >>
> >>  The values are definitely a child array.  However, encoding the run
> >>  ends as a buffer (similar to list array for example) makes it
> >>  difficult to calculate offsets.  Translating an array offset to a
> >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> >> a
> >>  child array (so the RLE array has no buffers and two child arrays)
> >>  then this problem goes away.
> >>
> >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> >>
> >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> >>  <tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>>
> >> wrote:
> >>  >
> >>  > Hello Everyone,
> >>  >
> >>  > Recently, I have implemented support for run-length encoding in
> >> Arrow
> >>  > C++. So far my implementation is split into different subtasks of
> >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
> >>  >
> >>  > I have (draft) PRs available for:
> >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> >>  > subclasses, etc.
> >>  >   (subtasks 1-9)
> >>  > - encode, decode kernels (fixed size only):
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> >>  > - filter kernel (fixed size only):
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> >>  > - simple benchmark for the RLE kernels
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> >>  > - adding RLE to Arrow Columnar format document
> >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> >>  >
> >>  > What is not yet implemented:
> >>  > - converting RLE to formats like Parquet, JSON, IPC.
> >>  > - caching of physical offsets when working with sliced arrays -
> >> finding
> >>  > these offsets is an  O(log(n)) binary search which could be
> >> avoided in
> >>  > a lot of cases
> >>  >
> >>  > I'm interested in any feedback on the code and I'm wondering what
> >> would
> >>  > be the best way to get this merged.
> >>  >
> >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> >> in a
> >>  > way they could be merged. The first would be:
> >>  > 1. Handling of array-only types using VisitTypeInline:
> >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> >>  > 2. Adding RLE type / array class (only builds on #1):
> >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> >>  > -  also, since it has no dependencies: adding RLE to Arrow
> >> Columnar
> >>  > format document
> >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> >>  >
> >>  > Best,
> >>  > Tobias
> >>
>
>

Re: PRs for RLE support

Reply via email to