Great, yes, please go ahead and open JIRA issues. That would be the appropriate place to make the development work more clearly specified
Thanks On Sun, Mar 3, 2019 at 7:36 PM Edmon Begoli <ebeg...@berkeley.edu> wrote: > > Thanks, Wes. > > _contrib_ could indeed be a good option for this. > > Unless the community objects, I suggest that I create a JIRA issue for this. > We could use that issue for tracking and documentation of the intended > purpose, design thinking, and also add as many details as possible. > > My team and I have every intention to implement this functionality, and > within next six months, so it would be indeed good to stay coordinated, and > integrate it into Arrow code base in some non-obtrusive way. > > Thank you, > Edmon > > > > > On Sun, Mar 3, 2019 at 7:57 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Edmon, > > > > Since we've just added a C++ API for "extension types" this might be a > > place to try these out to define custom container types for text: > > > > > > https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1 > > > > I don't have a sense of where such code should go in the project and > > how many users it might have. It seems from my perspective better to > > build something inside the Arrow community from the outset rather than > > deal with a code donation at some point later in time. > > > > It seems we might want to create a "contrib" directory (either > > cpp/src/arrow/contrib or cpp/contrib) for new things where we aren't > > sure what is to become of the code. > > > > - Wes > > > > On Sat, Mar 2, 2019 at 10:33 PM Edmon Begoli <ebeg...@berkeley.edu> wrote: > > > > > > Hi Micah, > > > > > > In short, we recognize that storing text as arrow is possible and easy if > > > we are to store text as array of bytes representing characters. > > > > > > What we are trying to do is to use arrow as the format/carrier between > > high > > > performance text processing steps which like to operate on binary data > > > structures (e.g. tries or DAFSAs). > > > > > > We have a working/draft approach where we would use arrow as the data > > > structure carrier, and we would use encoders/decoders for how these > > > structures are laid out into arrow layout. > > > > > > so, it could be something like: > > > > > > text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as > > > format for the specified encoding. This could be implicit if we could > > store > > > encoding in some kind of manifest > > > arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text > > from > > > the arrow format, and from a specified encoding, same as above. > > > > > > Let me know what you think. > > > > > > Thank you, > > > Edmon > > > > > > On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > > > Hi Edmon, > > > > This sound interesting, I'm not aware of any optimized text memory > > layout > > > > beyond our standard string layout. Are there more details about the > > work > > > > you are doing? It is a little bit hard to tell if this is a good fit > > for > > > > Arrow from your description. > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebeg...@berkeley.edu> > > wrote: > > > > > > > > > Colleagues: > > > > > > > > > > A colleague and I are working on optimized structures for memory and > > disk > > > > > layout for raw and pre-processed text using specialized data > > structures, > > > > > and with a goal of efficient I/O, inter-process transmissions, and > > > > > media/memory storage of text-oriented data (e.g. clinical narratives, > > > > > radiology and pathology reports, etc.) > > > > > > > > > > Has anyone on the Arrow dev team tackled this problem of efficient > > text > > > > > storage yet? > > > > > (not just plain text, but storing data structures in an arrow format) > > > > > > > > > > If not, would you welcome a contribution? > > > > > > > > > > Thank you, > > > > > Edmon > > > > > > > > > > >