Great, yes, please go ahead and open JIRA issues. That would be the
appropriate place to make the development work more clearly specified

Thanks

On Sun, Mar 3, 2019 at 7:36 PM Edmon Begoli <ebeg...@berkeley.edu> wrote:
>
> Thanks, Wes.
>
> _contrib_ could indeed be a good option for this.
>
> Unless the community objects, I suggest that I create a JIRA issue for this.
> We could use that issue for tracking and documentation of the intended
> purpose, design thinking, and also add as many details as possible.
>
> My team and I have every intention to implement this functionality, and
> within next six months, so it would be indeed good to stay coordinated, and
> integrate it into Arrow code base in some non-obtrusive way.
>
> Thank you,
> Edmon
>
>
>
>
> On Sun, Mar 3, 2019 at 7:57 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Edmon,
> >
> > Since we've just added a C++ API for "extension types" this might be a
> > place to try these out to define custom container types for text:
> >
> >
> > https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1
> >
> > I don't have a sense of where such code should go in the project and
> > how many users it might have. It seems from my perspective better to
> > build something inside the Arrow community from the outset rather than
> > deal with a code donation at some point later in time.
> >
> > It seems we might want to create a "contrib" directory (either
> > cpp/src/arrow/contrib or cpp/contrib) for new things where we aren't
> > sure what is to become of the code.
> >
> > - Wes
> >
> > On Sat, Mar 2, 2019 at 10:33 PM Edmon Begoli <ebeg...@berkeley.edu> wrote:
> > >
> > > Hi Micah,
> > >
> > > In short, we recognize that storing text as arrow is possible and easy if
> > > we are to store text as array of bytes representing characters.
> > >
> > > What we are trying to do is to use arrow as the format/carrier between
> > high
> > > performance text processing steps which like to operate on binary data
> > > structures (e.g. tries or DAFSAs).
> > >
> > > We have a working/draft approach where we would use arrow as the data
> > > structure carrier, and we would use encoders/decoders for how these
> > > structures are laid out into arrow layout.
> > >
> > > so, it could be something like:
> > >
> > > text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as
> > > format for the specified encoding. This could be implicit if we could
> > store
> > > encoding in some kind of manifest
> > > arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text
> > from
> > > the arrow format, and from a specified encoding, same as above.
> > >
> > > Let me know what you think.
> > >
> > > Thank you,
> > > Edmon
> > >
> > > On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Edmon,
> > > > This sound interesting, I'm not aware of any optimized text memory
> > layout
> > > > beyond our standard string layout.   Are there more details about the
> > work
> > > > you are doing?  It is a little bit hard to tell if this is a good fit
> > for
> > > > Arrow from your description.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebeg...@berkeley.edu>
> > wrote:
> > > >
> > > > > Colleagues:
> > > > >
> > > > > A colleague and I are working on optimized structures for memory and
> > disk
> > > > > layout for raw and pre-processed text using specialized data
> > structures,
> > > > > and with a goal of efficient I/O, inter-process transmissions, and
> > > > > media/memory storage of text-oriented data (e.g. clinical narratives,
> > > > > radiology and pathology reports, etc.)
> > > > >
> > > > > Has anyone on the Arrow dev team tackled this problem of efficient
> > text
> > > > > storage yet?
> > > > > (not just plain text, but storing data structures in an arrow format)
> > > > >
> > > > > If not, would you welcome a contribution?
> > > > >
> > > > > Thank you,
> > > > > Edmon
> > > > >
> > > >
> >

Reply via email to