Re: Text data structures-optimized layout in Arrow

2019-03-03 Thread Wes McKinney
Great, yes, please go ahead and open JIRA issues. That would be the appropriate place to make the development work more clearly specified Thanks On Sun, Mar 3, 2019 at 7:36 PM Edmon Begoli wrote: > > Thanks, Wes. > > _contrib_ could indeed be a good option for this. > > Unless the community obje

Re: Text data structures-optimized layout in Arrow

2019-03-03 Thread Edmon Begoli
Thanks, Wes. _contrib_ could indeed be a good option for this. Unless the community objects, I suggest that I create a JIRA issue for this. We could use that issue for tracking and documentation of the intended purpose, design thinking, and also add as many details as possible. My team and I hav

Re: Text data structures-optimized layout in Arrow

2019-03-03 Thread Wes McKinney
hi Edmon, Since we've just added a C++ API for "extension types" this might be a place to try these out to define custom container types for text: https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1 I don't have a sense of where such code should go in the project and

Re: Text data structures-optimized layout in Arrow

2019-03-02 Thread Edmon Begoli
Hi Micah, In short, we recognize that storing text as arrow is possible and easy if we are to store text as array of bytes representing characters. What we are trying to do is to use arrow as the format/carrier between high performance text processing steps which like to operate on binary data st

Re: Text data structures-optimized layout in Arrow

2019-03-02 Thread Micah Kornfield
Hi Edmon, This sound interesting, I'm not aware of any optimized text memory layout beyond our standard string layout. Are there more details about the work you are doing? It is a little bit hard to tell if this is a good fit for Arrow from your description. Thanks, Micah On Sat, Mar 2, 2019 a

Text data structures-optimized layout in Arrow

2019-03-02 Thread Edmon Begoli
Colleagues: A colleague and I are working on optimized structures for memory and disk layout for raw and pre-processed text using specialized data structures, and with a goal of efficient I/O, inter-process transmissions, and media/memory storage of text-oriented data (e.g. clinical narratives, ra