Re: PRs for RLE support

Matthew Topol Wed, 14 Sep 2022 11:34:17 -0700

> On the other hand, if there were two child arrays then an
implementation, when slicing, could choose to always keep the offset
of the parent array at 0 and instead put the offsets in the child
arrays.  Now you have a parent array with offset 0, a run ends (int32)
array with offset 74 and length 5 and a values (int64) array with
offset 74 and length 5.  We can clearly say that we want to grab bytes
296-316 from the run ends buffer and bytes 592-632 from the values
buffer.  Of course, other implementations would always be free to use
offsets in the parent array.  So I think the log(N) approach would
still be needed as a fallback.

Doesn't this explanation conflate the Logical Offset (the parent'soffset) and the Physical Offset (the offset of the run ends / valueschildren)? Or am I missing something? If you have an RLE<int64> arrayof length 200, and you slice it with {start: 10, length: 100}, I don'tunderstand how you can represent that while keeping the parent's offsetas "0" because the offset of "10" in the slice could represent anywherefrom 0 - 10 physical values you have to offset by. You'd still keep anoffset of 10 in the parent, and potentially cache the *physical* offsettoo, right?

The benefit I see for having the child arrays over the run-ends being abuffer in the RLE array directly is that you can easily tell how largethe buffers *should* be as the length and offset of the run ends(int32) array and the values array will always represent *physical*values while the parent's offset and length can represent the *logical*values of the RLE. If the run ends were a buffer in the RLE arraydirectly, it would be much more difficult to maintain the separation ofthe "physical" and "logical" offset/lengths.


Does that make sense?

--Matt

On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace<[email protected]> wrote:

I will clarify the offset problem.  It essentially boils down to "if
you don't have constant access to elements then an array length offset
does not give you constant access to buffer offsets".

We start with an RLE<int64> array of length 200.  We slice it with
(start=10, length=100) to get an RLE<int64> array of length 100 and an
offset of 10.

Now we want to write an IPC file (or access the values for whatever
reason).  The values buffer has 400 bytes and the run ends buffer has
200 bytes (these numbers could be anything less than 1600/800 so I'm
picking these at random).  We need to copy a portion of the "run ends"
buffer into the file.  What bytes are these?  The only way to tell
would be to do a binary search on the 200 bytes run ends buffer.

On the other hand, if there were two child arrays then an
implementation, when slicing, could choose to always keep the offset
of the parent array at 0 and instead put the offsets in the child
arrays.  Now you have a parent array with offset 0, a run ends (int32)
array with offset 74 and length 5 and a values (int64) array with
offset 74 and length 5.  We can clearly say that we want to grab bytes
296-316 from the run ends buffer and bytes 592-632 from the values
buffer.  Of course, other implementations would always be free to use
offsets in the parent array.  So I think the log(N) approach would
still be needed as a fallback.

Other options:

 * Just eat the log(n) cost, it's not that expensive and any
application doing excessive slicing could cache the offsets
themselves.
 * Add an optional buffer offset to the spec that can be populated in
cases where random array access is not possible.

On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
<[email protected]<mailto:[email protected]>> wrote:
 >  * Should we encode "run lengths" or "run ends"?
In addition to the points mentioned above, this seems the mostconsistent
 with the variable-length binary/list layouts
> encoding the run ends as a buffer (similar to list array forexample)
 makes it difficult to calculate offsets
I don't have a strong opinion about this, but I also don'tunderstand thelogic. Surely the implementation is just generating/reading abuffer ofintegers and there's some overhead/indirection to wrapping it in anArray
 (that must then be validated).
As a matter of curiosity, was a dictionary approach everconsidered? If onenew layout was added (one buffer containing the run ends of a RLE0:N int32array), the dictionary member could be the values array and perhapsmake it
 easier for implementations that already handle dictionaries.
On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol<[email protected] <mailto:[email protected]>>
 wrote:
> Just wanted to chime in here that I also have several draft PRsfor> implementing the RLE arrays in Go as the second implementation(since
 > we use two implementations as a requirement to vote on
 > changes/additions to the format).
 >
 > They can be found here:
 >
 > <<https://github.com/apache/arrow/pull/14111>>
 > <<https://github.com/apache/arrow/pull/14114>>
 > <<https://github.com/apache/arrow/pull/14126>>
 >
 > --Matt
 >
 > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
 > <[email protected] <mailto:[email protected]>> wrote:
 > >>
 > >>   * Should we encode "run lengths" or "run ends"?
 > >
 > >
> > I think the project has leaned towards sublinear access, so runends
 > > make
 > > sense.  The downside is that we run into similar issues with
 > > List/LargeList
> > where the total number of elements is limited by bit-width(which can
 > > also
> > cause space wastage, e.g. with run ends it might be reasonableto
 > > limit
 > > bit-width to 16).
 > >
> > The values are definitely a child array. However, encoding therun
 > >>  ends as a buffer (similar to list array for example) makes it
> >> difficult to calculate offsets. Translating an array offsetto a> >> buffer offset takes O(log(N)) time. If the run ends areencoded as
 > >> a
> >> child array (so the RLE array has no buffers and two childarrays)
 > >>  then this problem goes away.
 > >
 > >
> > I'm not sure I understand this, could you provide an example ofthe
 > > problem
 > > that the child array solves?
 > >
 > >
 > >
 > >
> > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace<[email protected] <mailto:[email protected]>
 > > <<mailto:[email protected]>>> wrote:
 > >
> >> I'm going to bump this because it would be good to getfeedback. In> >> particular it would be nice to get feedback on the suggestedformat> >> change[1]. We are currently moving forward on coming up withan IPC
 > >>  format proposal which we will share when ready.
 > >>
 > >>  The two interesting points that jump out to me are:
 > >>
 > >>   * Should we encode "run lengths" or "run ends"?
 > >>
> >> For example, should 5,5,5,6,6,7,7,7,7 be encoded with "runlengths"
 > >> 3,
> >> 2, 4 or "run ends" 3, 5, 9. In the proposal the latter ispreferred> >> as that leads to O(log(N)) random access (instead of O(N))and it's
 > >>  not clear there are any disadvantages.
 > >>
> >> * Should the run ends be encoded as a buffer or a childarray?
 > >>
> >> The values are definitely a child array. However, encodingthe run
 > >>  ends as a buffer (similar to list array for example) makes it
> >> difficult to calculate offsets. Translating an array offsetto a> >> buffer offset takes O(log(N)) time. If the run ends areencoded as
 > >> a
> >> child array (so the RLE array has no buffers and two childarrays)
 > >>  then this problem goes away.
 > >>
 > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
 > >>
 > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> >> <[email protected] <mailto:[email protected]><<mailto:[email protected]>>>
 > >> wrote:
 > >>  >
 > >>  > Hello Everyone,
 > >>  >
> >> > Recently, I have implemented support for run-lengthencoding in
 > >> Arrow
> >> > C++. So far my implementation is split into differentsubtasks of> >> > ARROW-16771(<<https://issues.apache.org/jira/browse/ARROW-16771>>).
 > >>  >
 > >>  > I have (draft) PRs available for:
 > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
 > >>  > subclasses, etc.
 > >>  >   (subtasks 1-9)
 > >>  > - encode, decode kernels (fixed size only):
 > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
 > >>  > - filter kernel (fixed size only):
 > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
 > >>  > - simple benchmark for the RLE kernels
 > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
 > >>  > - adding RLE to Arrow Columnar format document
 > >>  >   (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
 > >>  >
 > >>  > What is not yet implemented:
 > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> >> > - caching of physical offsets when working with slicedarrays -
 > >> finding
 > >>  > these offsets is an  O(log(n)) binary search which could be
 > >> avoided in
 > >>  > a lot of cases
 > >>  >
> >> > I'm interested in any feedback on the code and I'mwondering what
 > >> would
 > >>  > be the best way to get this merged.
 > >>  >
> >> > A lot of the PRs depend on earlier ones. I ordered thesubtasks
 > >> in a
 > >>  > way they could be merged. The first would be:
 > >>  > 1. Handling of array-only types using VisitTypeInline:
 > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
 > >>  > 2. Adding RLE type / array class (only builds on #1):
 > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
 > >>  > -  also, since it has no dependencies: adding RLE to Arrow
 > >> Columnar
 > >>  > format document
 > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
 > >>  >
 > >>  > Best,
 > >>  > Tobias
 > >>
 >
 >

Re: PRs for RLE support

Reply via email to