Just following up here - what's the status? It looks like there's some unaddressed comments on the PR?
On Tue, Nov 23, 2021, at 13:54, Micah Kornfield wrote: > Sorry I just took a closer look and left some comments. I think the one > substantive issue, is the document linked talks about different > length columns in the Bag, and this isn't mentioned in the flatbuffers? > Could you comment/update the documentations in flatbuffers accordingly? > > Thanks, > Micah > > On Tue, Nov 23, 2021 at 10:41 AM David Li <lidav...@apache.org> wrote: > >> Thanks for putting that up. >> >> It doesn't look like there's been too much discussion here. If people >> agree it's useful, maybe the next step is to draft an implementation in >> Java or C++ for feedback? There was some discussion on the use cases in the >> document, do we feel like we need to clarify that better? >> >> -David >> >> On Mon, Nov 8, 2021, at 16:46, Nate Bauernfeind wrote: >> > I put the draft up here: https://github.com/apache/arrow/pull/11646 >> > >> > Thanks. >> > >> > On Mon, Nov 8, 2021 at 1:57 PM David Li <lidav...@apache.org> wrote: >> > >> > > Hey Nate, >> > > >> > > Thanks for doing this! Would you be interested in putting that commit >> up >> > > as a draft PR for discussion? I think we can discuss there. >> > > >> > > I'm not sure anyone is actively working on RLE or other encoding >> schemes >> > > at the moment. >> > > >> > > -David >> > > >> > > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote: >> > > > I've written up the ColumnBag proposal addressing items 1 and 2 on >> the >> > > > list. I'm open to any and all feedback/suggestions. >> > > > >> > > > I'd be happy to add item 3 (binary metadata) to the proposed change >> set. >> > > > Let me know if you want me to whip up the initial suggestion for that >> > > > version (and whether or not to keep it separate from ColumnBag). >> > > > >> > > > Would RLE related efforts change the structure of RecordBatch or >> > > ColumnBag >> > > > (if accepted)? >> > > > >> > > > Here is the brief history-discussion around why ColumnBag: >> > > > >> > > >> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/ >> > > > >> > > > Here is a brief commit doctoring up the flatbuffer to support this >> > > version >> > > > of the proposed change: >> > > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1 >> > > > >> > > > I don't know if it's better to comment in the document or bring >> comments >> > > > back to the list. If it ends up being document heavy, then I'll >> summarize >> > > > the main points back on the list. >> > > > >> > > > I think I'll get started on a Java impl just to learn more even if it >> > > ends >> > > > up being extra work. >> > > > >> > > > Looking forward to your feedback, >> > > > Nate >> > > > >> > > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield < >> emkornfi...@gmail.com> >> > > > wrote: >> > > > >> > > > > I'm still interested in RLE related effort, but not sure about my >> > > available >> > > > > bandwidth (which is why I haven't made more of an effort there). >> > > > > >> > > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <wesmck...@gmail.com> >> > > wrote: >> > > > > >> > > > > > Another Flatbuffers/Message.fbs project we should rekindle soon, >> in >> > > > > > addition to the schema evolution/replacement question which has >> been >> > > > > > raised with Flight, is that of sparse/compressed data (e.g. >> RLE). I >> > > > > > have a vacation plus some travel coming up so won't be able to >> devote >> > > > > > meaningful attention to this until the last part of August, but >> would >> > > > > > like to help it move forward. >> > > > > > >> > > > > > >> > > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <lidav...@apache.org> >> > > wrote: >> > > > > > > >> > > > > > > Hey Nate, >> > > > > > > >> > > > > > > For the first two points, semantically I'm tempted to think of >> it >> > > more >> > > > > > like the ability to send a "bag of columns" according to some >> schema >> > > (and >> > > > > > hence columns could have differing lengths or even be absent). >> This >> > > could >> > > > > > be a new structure alongside a record batch, which is >> semantically >> > > like a >> > > > > > "slice of a table" (and hence rectangular and complete), instead >> of >> > > > > > exposing existing users of RecordBatch to rather different >> behavior. >> > > > > > > >> > > > > > > For #3, a different thread was discussing some of the points >> there >> > > - it >> > > > > > sounds like it may be possible to relax from map<string, string> >> to >> > > > > > map<string, binary>. >> > > > > > > >> > > > > > > -David >> > > > > > > >> > > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote: >> > > > > > > > Wes suggested that maybe there are enough new ideas that it >> may >> > > make >> > > > > > sense >> > > > > > > > to evolve-past the existing structures rather than to >> bolt-on new >> > > > > > > > functionality. I would like to learn what requirements exist >> > > should >> > > > > new >> > > > > > > > structures be adopted, and if applicable, would like to turn >> this >> > > > > into >> > > > > > a >> > > > > > > > full POC proposal. >> > > > > > > > >> > > > > > > > These are the features that I feel are missing from the >> existing >> > > > > > design: >> > > > > > > > - the ability to notify that the columns are not consistent >> in >> > > length >> > > > > > (e.g. >> > > > > > > > setting RecordBatch.length to -1; and give the arrow/flight >> user >> > > the >> > > > > > true >> > > > > > > > FieldNode lengths). >> > > > > > > > - the ability to skip top-level field nodes that have length >> 0 >> > > at a >> > > > > > small >> > > > > > > > cost (such as in a bitset) >> > > > > > > > - the ability to embed binary payload in the Message >> flatbuffer >> > > > > wrapper >> > > > > > > > (instead of String payload only) >> > > > > > > > - the ability to concurrently use more than one schema (the >> most >> > > > > > likely API >> > > > > > > > will look like how one identifies a dictionary. ideally >> > > dictionaries >> > > > > > could >> > > > > > > > be shared across field nodes in a schema or across schemas >> in the >> > > > > same >> > > > > > > > flight) >> > > > > > > > >> > > > > > > > What other features, or improvements, could/should be >> > > considered? Any >> > > > > > > > strong opinions against the ideas above? (Remember, that a >> goal >> > > of >> > > > > > mine is >> > > > > > > > to be able to send a RecordBatch of rows that were modified >> > > > > intersected >> > > > > > > > only by the field-nodes that have changed (including those >> with >> > > only >> > > > > > inner >> > > > > > > > node changes); thus the columns are a subset of the full >> schema >> > > and >> > > > > > that >> > > > > > > > the length of each node is independent of the other). >> > > > > > > > >> > > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney < >> wesmck...@gmail.com >> > > > >> > > > > > wrote: >> > > > > > > > > It sounds like we may want to discuss some potential >> > > evolutions of >> > > > > > the >> > > > > > > > > Arrow binary protocol (for example: new Message types). >> > > Certainly a >> > > > > > > > > can of worms but rather than trying to bolt some new >> > > functionality >> > > > > > > > > onto the existing structures, it might be better to support >> > > the new >> > > > > > > > > use cases through some new structures which will be more >> clear >> > > cut >> > > > > > > > > from a forward compatibility standpoint. >> > > > > > > > >> > > > > > > > Nate >> > > > > > > > >> > > > > > > > -- >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > >> > > >> > >> > >> > -- >> > >>