Hi Micah, In short, we recognize that storing text as arrow is possible and easy if we are to store text as array of bytes representing characters.
What we are trying to do is to use arrow as the format/carrier between high performance text processing steps which like to operate on binary data structures (e.g. tries or DAFSAs). We have a working/draft approach where we would use arrow as the data structure carrier, and we would use encoders/decoders for how these structures are laid out into arrow layout. so, it could be something like: text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as format for the specified encoding. This could be implicit if we could store encoding in some kind of manifest arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text from the arrow format, and from a specified encoding, same as above. Let me know what you think. Thank you, Edmon On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Edmon, > This sound interesting, I'm not aware of any optimized text memory layout > beyond our standard string layout. Are there more details about the work > you are doing? It is a little bit hard to tell if this is a good fit for > Arrow from your description. > > Thanks, > Micah > > On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebeg...@berkeley.edu> wrote: > > > Colleagues: > > > > A colleague and I are working on optimized structures for memory and disk > > layout for raw and pre-processed text using specialized data structures, > > and with a goal of efficient I/O, inter-process transmissions, and > > media/memory storage of text-oriented data (e.g. clinical narratives, > > radiology and pathology reports, etc.) > > > > Has anyone on the Arrow dev team tackled this problem of efficient text > > storage yet? > > (not just plain text, but storing data structures in an arrow format) > > > > If not, would you welcome a contribution? > > > > Thank you, > > Edmon > > >