Hi folks, I put some more thought into the "IO problem" as it relates Arrow in C++ (and transitively, Python) and wrote a short Google document with my thoughts on it:
https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit# Feedback greatly appreciated! This will be on my critical path in the near future, so I would like to know if I'm approaching the problem right, and we are in alignment (then can break things down into a bunch of JIRAs). (I can also post this doc directly to the mailing list, I thought the initial discussion would be simpler in a GDoc) Thank you Wes On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <wesmck...@gmail.com> wrote: > On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <emkornfi...@gmail.com> > wrote: >> Hi Wes, >> >> At what level do you imagine, the "opt-in" happening. Right now it >> seems like it would be fairly straightforward at build time. However, >> when we start packaging pyarrow for distribution how do you imagine it >> will work? (If [1] already answers this, please let me know, I've been >> meaning to take a look at it). >> > > Where packaging and distribution is concerned, it'd be easiest to > provide non-picky users with a kitchen sink build, but otherwise > developers could create precisely the build they want with CMake > flags, I guess. If certain libraries aren't found then we wouldn't > fail the build by default, for example. > >> I need to grok the python code base a little bit more to understand >> the implications of the scope creep and the pain around taking a more >> fine-grained component approach. But in general my experience has >> been that packaging things together while maintaining clear internal >> code boundaries for later separation is a good pragmatic approach. >> > > I'd propose creating an `arrow_io` leaf shared library where we can > create a small IO subsystem for reuse amongst different data > connectors. We can leave things fairly coarse grained for the time > being and break things up later if it becomes onerous for other Arrow > developer-users. > >> As a side note, hopefully, we'll be able to re-use some existing >> projects to do the heavy lifting for blob store integration. SFrame >> is one option [2] and [3] might be worth investigating as well (both >> appear to be Apache 2.0 licensed). > > While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper > around libhdfs) doesn't excite me that much, the prospect of bugs (or > secure cluster issues) creeping up from a 3rd-party HDFS client > without the ability to escalate problems to the Apache Hadoop team > worries me even more. There is a new official C++ HDFS client in the > works after the libhdfs3 patch was not accepted > (https://issues.apache.org/jira/browse/HDFS-8707), so this may be > worth pursuing once it matures. > > Thoughts on this welcome. > > - Wes > >> >> Thanks, >> -Micah >> >> [1] https://github.com/apache/arrow/pull/79/files >> [2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 >> [3] https://github.com/aws/aws-sdk-cpp >> >>