Re: PRs for RLE support

Micah Kornfield Wed, 14 Sep 2022 09:44:42 -0700

>
>  * Should we encode "run lengths" or "run ends"?



I think the project has leaned towards sublinear access, so run ends make
sense.  The downside is that we run into similar issues with List/LargeList
where the total number of elements is limited by bit-width (which can also
cause space wastage, e.g. with run ends it might be reasonable to limit
bit-width to 16).

The values are definitely a child array.  However, encoding the run
> ends as a buffer (similar to list array for example) makes it
> difficult to calculate offsets.  Translating an array offset to a
> buffer offset takes O(log(N)) time.  If the run ends are encoded as a
> child array (so the RLE array has no buffers and two child arrays)
> then this problem goes away.


I'm not sure I understand this, could you provide an example of the problem
that the child array solves?




On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <[email protected]> wrote:

> I'm going to bump this because it would be good to get feedback.  In
> particular it would be nice to get feedback on the suggested format
> change[1].  We are currently moving forward on coming up with an IPC
> format proposal which we will share when ready.
>
> The two interesting points that jump out to me are:
>
>  * Should we encode "run lengths" or "run ends"?
>
> For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 3,
> 2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> as that leads to O(log(N)) random access (instead of O(N)) and it's
> not clear there are any disadvantages.
>
>  * Should the run ends be encoded as a buffer or a child array?
>
> The values are definitely a child array.  However, encoding the run
> ends as a buffer (similar to list array for example) makes it
> difficult to calculate offsets.  Translating an array offset to a
> buffer offset takes O(log(N)) time.  If the run ends are encoded as a
> child array (so the RLE array has no buffers and two child arrays)
> then this problem goes away.
>
> [1] https://github.com/apache/arrow/pull/13333/files
>
> On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> <[email protected]> wrote:
> >
> > Hello Everyone,
> >
> > Recently, I have implemented support for run-length encoding in Arrow
> > C++. So far my implementation is split into different subtasks of
> > ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).
> >
> > I have (draft) PRs available for:
> > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > subclasses, etc.
> >   (subtasks 1-9)
> > - encode, decode kernels (fixed size only):
> >   (https://issues.apache.org/jira/browse/ARROW-16772)
> > - filter kernel (fixed size only):
> >   (https://issues.apache.org/jira/browse/ARROW-16774)
> > - simple benchmark for the RLE kernels
> >   (https://issues.apache.org/jira/browse/ARROW-17026)
> > - adding RLE to Arrow Columnar format document
> >   (https://issues.apache.org/jira/browse/ARROW-16773)
> >
> > What is not yet implemented:
> > - converting RLE to formats like Parquet, JSON, IPC.
> > - caching of physical offsets when working with sliced arrays - finding
> > these offsets is an  O(log(n)) binary search which could be avoided in
> > a lot of cases
> >
> > I'm interested in any feedback on the code and I'm wondering what would
> > be the best way to get this merged.
> >
> > A lot of the PRs depend on earlier ones. I ordered the subtasks in a
> > way they could be merged. The first would be:
> > 1. Handling of array-only types using VisitTypeInline:
> >    https://issues.apache.org/jira/browse/ARROW-17258
> > 2. Adding RLE type / array class (only builds on #1):
> >    https://issues.apache.org/jira/browse/ARROW-17261
> > -  also, since it has no dependencies: adding RLE to Arrow Columnar
> > format document
> >    https://issues.apache.org/jira/browse/ARROW-16773
> >
> > Best,
> > Tobias
>

Re: PRs for RLE support

Reply via email to