I put the draft up here: https://github.com/apache/arrow/pull/11646
Thanks. On Mon, Nov 8, 2021 at 1:57 PM David Li <lidav...@apache.org> wrote: > Hey Nate, > > Thanks for doing this! Would you be interested in putting that commit up > as a draft PR for discussion? I think we can discuss there. > > I'm not sure anyone is actively working on RLE or other encoding schemes > at the moment. > > -David > > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote: > > I've written up the ColumnBag proposal addressing items 1 and 2 on the > > list. I'm open to any and all feedback/suggestions. > > > > I'd be happy to add item 3 (binary metadata) to the proposed change set. > > Let me know if you want me to whip up the initial suggestion for that > > version (and whether or not to keep it separate from ColumnBag). > > > > Would RLE related efforts change the structure of RecordBatch or > ColumnBag > > (if accepted)? > > > > Here is the brief history-discussion around why ColumnBag: > > > https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/ > > > > Here is a brief commit doctoring up the flatbuffer to support this > version > > of the proposed change: > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1 > > > > I don't know if it's better to comment in the document or bring comments > > back to the list. If it ends up being document heavy, then I'll summarize > > the main points back on the list. > > > > I think I'll get started on a Java impl just to learn more even if it > ends > > up being extra work. > > > > Looking forward to your feedback, > > Nate > > > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > I'm still interested in RLE related effort, but not sure about my > available > > > bandwidth (which is why I haven't made more of an effort there). > > > > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > Another Flatbuffers/Message.fbs project we should rekindle soon, in > > > > addition to the schema evolution/replacement question which has been > > > > raised with Flight, is that of sparse/compressed data (e.g. RLE). I > > > > have a vacation plus some travel coming up so won't be able to devote > > > > meaningful attention to this until the last part of August, but would > > > > like to help it move forward. > > > > > > > > > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <lidav...@apache.org> > wrote: > > > > > > > > > > Hey Nate, > > > > > > > > > > For the first two points, semantically I'm tempted to think of it > more > > > > like the ability to send a "bag of columns" according to some schema > (and > > > > hence columns could have differing lengths or even be absent). This > could > > > > be a new structure alongside a record batch, which is semantically > like a > > > > "slice of a table" (and hence rectangular and complete), instead of > > > > exposing existing users of RecordBatch to rather different behavior. > > > > > > > > > > For #3, a different thread was discussing some of the points there > - it > > > > sounds like it may be possible to relax from map<string, string> to > > > > map<string, binary>. > > > > > > > > > > -David > > > > > > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote: > > > > > > Wes suggested that maybe there are enough new ideas that it may > make > > > > sense > > > > > > to evolve-past the existing structures rather than to bolt-on new > > > > > > functionality. I would like to learn what requirements exist > should > > > new > > > > > > structures be adopted, and if applicable, would like to turn this > > > into > > > > a > > > > > > full POC proposal. > > > > > > > > > > > > These are the features that I feel are missing from the existing > > > > design: > > > > > > - the ability to notify that the columns are not consistent in > length > > > > (e.g. > > > > > > setting RecordBatch.length to -1; and give the arrow/flight user > the > > > > true > > > > > > FieldNode lengths). > > > > > > - the ability to skip top-level field nodes that have length 0 > at a > > > > small > > > > > > cost (such as in a bitset) > > > > > > - the ability to embed binary payload in the Message flatbuffer > > > wrapper > > > > > > (instead of String payload only) > > > > > > - the ability to concurrently use more than one schema (the most > > > > likely API > > > > > > will look like how one identifies a dictionary. ideally > dictionaries > > > > could > > > > > > be shared across field nodes in a schema or across schemas in the > > > same > > > > > > flight) > > > > > > > > > > > > What other features, or improvements, could/should be > considered? Any > > > > > > strong opinions against the ideas above? (Remember, that a goal > of > > > > mine is > > > > > > to be able to send a RecordBatch of rows that were modified > > > intersected > > > > > > only by the field-nodes that have changed (including those with > only > > > > inner > > > > > > node changes); thus the columns are a subset of the full schema > and > > > > that > > > > > > the length of each node is independent of the other). > > > > > > > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <wesmck...@gmail.com > > > > > > wrote: > > > > > > > It sounds like we may want to discuss some potential > evolutions of > > > > the > > > > > > > Arrow binary protocol (for example: new Message types). > Certainly a > > > > > > > can of worms but rather than trying to bolt some new > functionality > > > > > > > onto the existing structures, it might be better to support > the new > > > > > > > use cases through some new structures which will be more clear > cut > > > > > > > from a forward compatibility standpoint. > > > > > > > > > > > > Nate > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > -- > > > --