Re: IO considerations for PyArrow

Wes McKinney Wed, 15 Jun 2016 18:36:47 -0700

Hi folks,

I put some more thought into the "IO problem" as it relates Arrow in
C++ (and transitively, Python) and wrote a short Google document with
my thoughts on it:


https://docs.google.com/document/d/16y-eyIgSVL8m5Q7Mmh-jIDRwlh-r0bYatYuDl4sbMIk/edit#

Feedback greatly appreciated! This will be on my critical path in the
near future, so I would like to know if I'm approaching the problem
right, and we are in alignment (then can break things down into a
bunch of JIRAs).

(I can also post this doc directly to the mailing list, I thought the
initial discussion would be simpler in a GDoc)

Thank you
Wes

On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <emkornfi...@gmail.com> 
> wrote:
>> Hi Wes,
>>
>> At what level do you imagine, the "opt-in" happening.  Right now it
>> seems like it would be fairly straightforward at build time.  However,
>> when we start packaging pyarrow for distribution how do you imagine it
>> will work? (If [1] already answers this, please let me know, I've been
>> meaning to take a look at it).
>>
>
> Where packaging and distribution is concerned, it'd be easiest to
> provide non-picky users with a kitchen sink build, but otherwise
> developers could create precisely the build they want with CMake
> flags, I guess. If certain libraries aren't found then we wouldn't
> fail the build by default, for example.
>
>> I need to grok the python code base a little bit more to understand
>> the implications of the scope creep and the pain around taking a more
>> fine-grained component approach.  But in general my experience has
>> been that packaging things together while maintaining clear internal
>> code boundaries for later separation is a good pragmatic approach.
>>
>
> I'd propose creating an `arrow_io` leaf shared library where we can
> create a small IO subsystem for reuse amongst different data
> connectors. We can leave things fairly coarse grained for the time
> being and break things up later if it becomes onerous for other Arrow
> developer-users.
>
>> As a side note, hopefully, we'll be able to re-use some existing
>> projects to do the heavy lifting for blob store integration.  SFrame
>> is one option [2] and [3] might be worth investigating as well (both
>> appear to be Apache 2.0 licensed).
>
> While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
> around libhdfs) doesn't excite me that much, the prospect of bugs (or
> secure cluster issues) creeping up from a 3rd-party HDFS client
> without the ability to escalate problems to the Apache Hadoop team
> worries me even more. There is a new official C++ HDFS client in the
> works after the libhdfs3 patch was not accepted
> (https://issues.apache.org/jira/browse/HDFS-8707), so this may be
> worth pursuing once it matures.
>
> Thoughts on this welcome.
>
> - Wes
>
>>
>> Thanks,
>> -Micah
>>
>> [1] https://github.com/apache/arrow/pull/79/files
>> [2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
>> [3] https://github.com/aws/aws-sdk-cpp
>>
>>

Re: IO considerations for PyArrow

Reply via email to