Huzzah! That brings us to 3 +1 (binding) votes, and 1 +1 (non-binding) vote!
The vote passes! I've updated the PR for the format changes (on their own) here: https://github.com/apache/arrow/pull/14176 and will follow it up with updating the other PRs as I can. If anyone could comment / approve that PR, I'll merge it to kick this off and start getting the other PRs ready for review. Thanks everyone! On Mon, Dec 19, 2022 at 4:59 PM Ian Cook <i...@ursacomputing.com> wrote: > @Matt Topol: Yes, a change of the name to "run-end encoding" changes > my (non-binding) vote to a +1. > > On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol > <m...@voltrondata.com.invalid> wrote: > > > > Okay, slight edit to my previous email: It was brought to my attention > that > > we need at least 3 +1 binding votes, so this vote is still open for the > > moment. > > > > @IanCook: With the change of the name to RunEndEncoding is that > sufficient > > to change your vote to a +1? > > > > On Mon, Dec 19, 2022 at 12:57 PM Matt Topol <zotthewiz...@gmail.com> > wrote: > > > > > That leaves us with a total vote of +1.5 so the vote carries with the > > > caveat of changing the name to be Run End Encoded rather than Run > Length > > > Encoded (unless this means I need to do a new vote with the changed > name? > > > This is my first time doing one of these so please correct me if I > need to > > > do a new vote!) > > > > > > Thanks everyone for your feedback and comments! > > > > > > I'm going to go update the Go and Format specific PRs to make them > regular > > > PR's (instead of drafts) and get this all moving. Thanks in advance to > > > anyone who reviews the upcoming PRs! > > > > > > --Matt > > > > > > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > > > +1 > > > > > > > > I agree that run-end encoding makes more sense but also don't see it > > > > as a deal breaker. > > > > > > > > The most compelling counter-argument I've seen for new types is to > > > > avoid a schism where some implementations do not support the newer > > > > types. However, for the type proposed here I think the risk is low > > > > because data can be losslessly converted to existing formats for > > > > compatibility with any system that doesn't support the type. > > > > > > > > Another argument I've seen is that we should introduce a more formal > > > > distinction between "layouts" and "types" (with dictionary and > > > > run-end-encoding being layouts). However, this seems like an > > > > impractical change at this point. In addition, given that we have > > > > dictionary as an array type the cat is already out of the bag. > > > > Furthermore, systems and implementations are still welcome to make > > > > this distinction themselves. The spec only needs to specify what the > > > > buffer layouts should be. If a particular library chooses to group > > > > those layouts into two different categories I think that would still > > > > be feasible. > > > > > > > > -Weston > > > > > > > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com> > > > wrote: > > > > > > > > > > +1 on the proposal as written > > > > > > > > > > I think it makes sense and offers exciting opportunities for faster > > > > > computation (especially for cases where parquet files can be > decoded > > > > > directly into such an array and avoid unpacking. RLE encoded > dictionary > > > > are > > > > > quite compelling) > > > > > > > > > > I would prefer to use the term Run-End-Encoding (which would also > > > follow > > > > > the naming of the internal fields) but I don't view that as a deal > > > > blocker. > > > > > > > > > > Thank you for all your work in this matter, > > > > > Andrew > > > > > > > > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zotthewiz...@gmail.com > > > > > > wrote: > > > > > > > > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if > that > > > > would > > > > > > be preferable. Hopefully others will chime in with their > feedback. > > > > > > > > > > > > --Matt > > > > > > > > > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <i...@ursacomputing.com > > > > > > wrote: > > > > > > > > > > > > > Thank you Matt, Tobias, and others for the great work on this. > > > > > > > > > > > > > > I am -0.5 on this proposal in its current form because (pardon > the > > > > > > > pedantry) what we have implemented here is not run-length > encoding; > > > > it > > > > > > > is run-end encoding. Based on community input, the choice was > made > > > to > > > > > > > store run ends instead of run lengths because this enables > > > O(log(N)) > > > > > > > random access as opposed to O(N). This is a sensible choice, > but it > > > > > > > comes with some trade-offs including limitations in array > length > > > > > > > (which maybe not really a problem in practice) and lack of > > > > bit-for-bit > > > > > > > equivalence with RLE encodings that use run lengths like > Velox's > > > > > > > SequenceVector encoding (which I think is a more serious > problem in > > > > > > > practice). > > > > > > > > > > > > > > I believe that we should either: > > > > > > > (a) rename this to "run-end encoding" > > > > > > > (b) change this to a parameterized type called "run encoding" > that > > > > > > > takes a Boolean parameter specifying whether run lengths or run > > > ends > > > > > > > are stored. > > > > > > > > > > > > > > Ian > > > > > > > > > > > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol < > > > zotthewiz...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > I'd like to propose adding the RLE type based on earlier > > > > > > > discussions[1][2] > > > > > > > > to the Arrow format: > > > > > > > > - Columnar Format description: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60 > > > > > > > > - Flatbuffers changes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07 > > > > > > > > > > > > > > > > There is a proposed implementation available in both C++ > (written > > > > by > > > > > > > Tobias > > > > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the > same > > > > tests > > > > > > > > implemented and were tested to be compatible over IPC with an > > > > archery > > > > > > > test. > > > > > > > > In both cases, the implementations are split out among > several > > > > Draft > > > > > > PRs > > > > > > > so > > > > > > > > that they can be easily reviewed piecemeal if the vote is > > > approved, > > > > > > with > > > > > > > > each Draft PR including the changes of the one before it. The > > > links > > > > > > > > provided are the Draft PRs with the entirety of the changes > > > > included. > > > > > > > > > > > > > > > > The vote will be open for at least 72 hours. > > > > > > > > > > > > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format > > > > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow > > > format > > > > > > > > because... > > > > > > > > > > > > > > > > Thanks much, and please let me know if any more information > or > > > > links > > > > > > are > > > > > > > > needed (I've never proposed a vote before on here!) > > > > > > > > > > > > > > > > --Matt > > > > > > > > > > > > > > > > [1] > > > > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29 > > > > > > > > [2] > > > > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq > > > > > > > > [3] https://github.com/apache/arrow/pull/14179 > > > > > > > > [4] https://github.com/apache/arrow/pull/14223 > > > > > > > > > > > > > > > > > > > > >