Huzzah!

That brings us to 3 +1 (binding) votes, and 1 +1 (non-binding) vote!

The vote passes! I've updated the PR for the format changes (on their own)
here: https://github.com/apache/arrow/pull/14176 and will follow it up with
updating the other PRs as I can. If anyone could comment / approve that PR,
I'll merge it to kick this off and start getting the other PRs ready for
review.

Thanks everyone!

On Mon, Dec 19, 2022 at 4:59 PM Ian Cook <i...@ursacomputing.com> wrote:

> @Matt Topol: Yes, a change of the name to "run-end encoding" changes
> my (non-binding) vote to a +1.
>
> On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
> <m...@voltrondata.com.invalid> wrote:
> >
> > Okay, slight edit to my previous email: It was brought to my attention
> that
> > we need at least 3 +1 binding votes, so this vote is still open for the
> > moment.
> >
> > @IanCook: With the change of the name to RunEndEncoding is that
> sufficient
> > to change your vote to a +1?
> >
> > On Mon, Dec 19, 2022 at 12:57 PM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >
> > > That leaves us with a total vote of +1.5 so the vote carries with the
> > > caveat of changing the name to be Run End Encoded rather than Run
> Length
> > > Encoded (unless this means I need to do a new vote with the changed
> name?
> > > This is my first time doing one of these so please correct me if I
> need to
> > > do a new vote!)
> > >
> > > Thanks everyone for your feedback and comments!
> > >
> > > I'm going to go update the Go and Format specific PRs to make them
> regular
> > > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > > anyone who reviews the upcoming PRs!
> > >
> > > --Matt
> > >
> > > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> > >
> > > > +1
> > > >
> > > > I agree that run-end encoding makes more sense but also don't see it
> > > > as a deal breaker.
> > > >
> > > > The most compelling counter-argument I've seen for new types is to
> > > > avoid a schism where some implementations do not support the newer
> > > > types.  However, for the type proposed here I think the risk is low
> > > > because data can be losslessly converted to existing formats for
> > > > compatibility with any system that doesn't support the type.
> > > >
> > > > Another argument I've seen is that we should introduce a more formal
> > > > distinction between "layouts" and "types" (with dictionary and
> > > > run-end-encoding being layouts).  However, this seems like an
> > > > impractical change at this point.  In addition, given that we have
> > > > dictionary as an array type the cat is already out of the bag.
> > > > Furthermore, systems and implementations are still welcome to make
> > > > this distinction themselves.  The spec only needs to specify what the
> > > > buffer layouts should be.  If a particular library chooses to group
> > > > those layouts into two different categories I think that would still
> > > > be feasible.
> > > >
> > > > -Weston
> > > >
> > > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com>
> > > wrote:
> > > > >
> > > > > +1 on the proposal as written
> > > > >
> > > > > I think it makes sense and offers exciting opportunities for faster
> > > > > computation (especially for cases where parquet files can be
> decoded
> > > > > directly into such an array and avoid unpacking. RLE encoded
> dictionary
> > > > are
> > > > > quite compelling)
> > > > >
> > > > > I would prefer to use the term Run-End-Encoding (which would also
> > > follow
> > > > > the naming of the internal fields) but I don't view that as a deal
> > > > blocker.
> > > > >
> > > > > Thank you for all your work in this matter,
> > > > > Andrew
> > > > >
> > > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zotthewiz...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if
> that
> > > > would
> > > > > > be preferable. Hopefully others will chime in with their
> feedback.
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <i...@ursacomputing.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > > >
> > > > > > > I am -0.5 on this proposal in its current form because (pardon
> the
> > > > > > > pedantry) what we have implemented here is not run-length
> encoding;
> > > > it
> > > > > > > is run-end encoding. Based on community input, the choice was
> made
> > > to
> > > > > > > store run ends instead of run lengths because this enables
> > > O(log(N))
> > > > > > > random access as opposed to O(N). This is a sensible choice,
> but it
> > > > > > > comes with some trade-offs including limitations in array
> length
> > > > > > > (which maybe not really a problem in practice) and lack of
> > > > bit-for-bit
> > > > > > > equivalence with RLE encodings that use run lengths like
> Velox's
> > > > > > > SequenceVector encoding (which I think is a more serious
> problem in
> > > > > > > practice).
> > > > > > >
> > > > > > > I believe that we should either:
> > > > > > > (a) rename this to "run-end encoding"
> > > > > > > (b) change this to a parameterized type called "run encoding"
> that
> > > > > > > takes a Boolean parameter specifying whether run lengths or run
> > > ends
> > > > > > > are stored.
> > > > > > >
> > > > > > > Ian
> > > > > > >
> > > > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> > > zotthewiz...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > > > discussions[1][2]
> > > > > > > > to the Arrow format:
> > > > > > > > - Columnar Format description:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > > > - Flatbuffers changes:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > > > >
> > > > > > > > There is a proposed implementation available in both C++
> (written
> > > > by
> > > > > > > Tobias
> > > > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the
> same
> > > > tests
> > > > > > > > implemented and were tested to be compatible over IPC with an
> > > > archery
> > > > > > > test.
> > > > > > > > In both cases, the implementations are split out among
> several
> > > > Draft
> > > > > > PRs
> > > > > > > so
> > > > > > > > that they can be easily reviewed piecemeal if the vote is
> > > approved,
> > > > > > with
> > > > > > > > each Draft PR including the changes of the one before it. The
> > > links
> > > > > > > > provided are the Draft PRs with the entirety of the changes
> > > > included.
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > >
> > > > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow
> > > format
> > > > > > > > because...
> > > > > > > >
> > > > > > > > Thanks much, and please let me know if any more information
> or
> > > > links
> > > > > > are
> > > > > > > > needed (I've never proposed a vote before on here!)
> > > > > > > >
> > > > > > > > --Matt
> > > > > > > >
> > > > > > > > [1]
> > > > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > > > > [2]
> > > > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > > > > [4] https://github.com/apache/arrow/pull/14223
> > > > > > >
> > > > > >
> > > >
> > >
>

Reply via email to