Re: [VOTE] Add RLE Arrays to Arrow Format

Matt Topol Mon, 19 Dec 2022 09:56:56 -0800

That leaves us with a total vote of +1.5 so the vote carries with the
caveat of changing the name to be Run End Encoded rather than Run Length
Encoded (unless this means I need to do a new vote with the changed name?
This is my first time doing one of these so please correct me if I need to
do a new vote!)


Thanks everyone for your feedback and comments!

I'm going to go update the Go and Format specific PRs to make them regular
PR's (instead of drafts) and get this all moving. Thanks in advance to
anyone who reviews the upcoming PRs!

--Matt

On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <[email protected]> wrote:

> +1
>
> I agree that run-end encoding makes more sense but also don't see it
> as a deal breaker.
>
> The most compelling counter-argument I've seen for new types is to
> avoid a schism where some implementations do not support the newer
> types.  However, for the type proposed here I think the risk is low
> because data can be losslessly converted to existing formats for
> compatibility with any system that doesn't support the type.
>
> Another argument I've seen is that we should introduce a more formal
> distinction between "layouts" and "types" (with dictionary and
> run-end-encoding being layouts).  However, this seems like an
> impractical change at this point.  In addition, given that we have
> dictionary as an array type the cat is already out of the bag.
> Furthermore, systems and implementations are still welcome to make
> this distinction themselves.  The spec only needs to specify what the
> buffer layouts should be.  If a particular library chooses to group
> those layouts into two different categories I think that would still
> be feasible.
>
> -Weston
>
> On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <[email protected]> wrote:
> >
> > +1 on the proposal as written
> >
> > I think it makes sense and offers exciting opportunities for faster
> > computation (especially for cases where parquet files can be decoded
> > directly into such an array and avoid unpacking. RLE encoded dictionary
> are
> > quite compelling)
> >
> > I would prefer to use the term Run-End-Encoding (which would also follow
> > the naming of the internal fields) but I don't view that as a deal
> blocker.
> >
> > Thank you for all your work in this matter,
> > Andrew
> >
> > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <[email protected]>
> wrote:
> >
> > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> would
> > > be preferable. Hopefully others will chime in with their feedback.
> > >
> > > --Matt
> > >
> > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <[email protected]>
> wrote:
> > >
> > > > Thank you Matt, Tobias, and others for the great work on this.
> > > >
> > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > pedantry) what we have implemented here is not run-length encoding;
> it
> > > > is run-end encoding. Based on community input, the choice was made to
> > > > store run ends instead of run lengths because this enables O(log(N))
> > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > comes with some trade-offs including limitations in array length
> > > > (which maybe not really a problem in practice) and lack of
> bit-for-bit
> > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > SequenceVector encoding (which I think is a more serious problem in
> > > > practice).
> > > >
> > > > I believe that we should either:
> > > > (a) rename this to "run-end encoding"
> > > > (b) change this to a parameterized type called "run encoding" that
> > > > takes a Boolean parameter specifying whether run lengths or run ends
> > > > are stored.
> > > >
> > > > Ian
> > > >
> > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <[email protected]>
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like to propose adding the RLE type based on earlier
> > > > discussions[1][2]
> > > > > to the Arrow format:
> > > > > - Columnar Format description:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > - Flatbuffers changes:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > >
> > > > > There is a proposed implementation available in both C++ (written
> by
> > > > Tobias
> > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> tests
> > > > > implemented and were tested to be compatible over IPC with an
> archery
> > > > test.
> > > > > In both cases, the implementations are split out among several
> Draft
> > > PRs
> > > > so
> > > > > that they can be easily reviewed piecemeal if the vote is approved,
> > > with
> > > > > each Draft PR including the changes of the one before it. The links
> > > > > provided are the Draft PRs with the entirety of the changes
> included.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > > > because...
> > > > >
> > > > > Thanks much, and please let me know if any more information or
> links
> > > are
> > > > > needed (I've never proposed a vote before on here!)
> > > > >
> > > > > --Matt
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > [2]
> https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > [4] https://github.com/apache/arrow/pull/14223
> > > >
> > >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Reply via email to