Re: IO considerations for PyArrow

Edmon Begoli Fri, 03 Jun 2016 08:04:24 -0700

Let me throw a thought - what about looking to support access to different
systems (including Alluxio) through a common POSIX interface such as FUSE?


Will there be a significant performance impact or a loss of control of the
layout?

On Fri, Jun 3, 2016 at 9:26 AM, Uwe Korn <uw...@xhochy.com> wrote:

> Hello,
>
> I would also embrace the scope creep. As we will deal with a lot of data,
> the cross-language I/O impact will significantly matter for performance at
> the end. We definitely have to be careful in making the dependencies
> toggleable in the build system. You should be able to easily get a build
> with all dependencies but also it can very selective on which ones are
> included in a build.
>
> For HDFS and S3 support, I'm not sure if either arrow-cpp, pyarrow or
> parquet-cpp is the right place for their C++ implementation. For arrow-cpp
> it would be the same scope creep as for PyArrow and it could be already
> used by C++ Arrow users but in parquet-cpp these IO classes would also be
> helpful for the non-arrow users.  For the moment I would put the C++
> implementations into arrow-cpp, as this keeps the scope creep in Arrow
> itself but already provides value to the C++ users and other languages
> building on that layer.
>
> Cheers,
>
> Uwe
>
>
> On 01.06.16 02:44, Wes McKinney wrote:
>
>> hi folks,
>>
>> I wanted to bring up what is likely to become an issue very soon in
>> the context of our work to provide an Arrow-based Parquet interface
>> for Python Arrow users.
>>
>> https://github.com/apache/arrow/pull/83
>>
>> At the moment, parquet-cpp features an API that enables reading a file
>> from local disk (using C standard library calls):
>>
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111
>>
>> This is fine for now, however we will quickly need to deal with a few
>> additional sources of data:
>>
>> 1) File-like Python objects (i.e. an object that has `seek`, `tell`,
>> and `read` methods)
>> 2) Remote blob stores: HDFS and S3
>>
>> Implementing #1 at present is a routine exercise in using the Python C
>> API. #2 is less so -- one of the approaches that has been taken by
>> others is to create separate Python file-like wrapper classes for
>> remote storage to make them seem file like. This has multiple
>> downsides:
>>
>> - read/seek/tell calls must cross up into the Python interpreter and
>> back down into the C++ layer
>> - bytes buffered by read calls get copied into Python bytes objects
>> (see PyBytes_FromStringAndSize)
>>
>> Outside of the GIL / concurrency issues, there's efficiency loss that
>> can be remedied by implementing instead:
>>
>> - Direct C/C++-level interface (independent of Python interpreter)
>> with remote blob stores. These can then buffer bytes directly in the
>> form requested by other C++ consumer libraries (like parquet-cpp)
>>
>> - Implement a Python file-like interface, so that users can still get
>> at the bytes in pure Python if they want (for example: some functions,
>> like pandas.read_csv, primarily deal with file-like things)
>>
>> This is a clearly superior solution, and has been notably pursued in
>> recent times by Dato's SFrame library (BSD 3-clause):
>>
>> https://github.com/dato-code/SFrame/tree/master/oss_src/fileio
>>
>> The problem, however, is the inevitable scope creep for the PyArrow
>> Python package. Unlike some other programming languages, Python
>> programmers face a substantial development complexity burden if they
>> choose to break libraries containing C extensions into smaller
>> components, as libraries must define "internal" C APIs for each other
>> to connect together . Notable example is NumPy
>> (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
>> API is already being used in PyArrow.
>>
>> I've been thinking about this problem for several weeks, and my net
>> recommendation is that we embrace the scope creep in PyArrow (as long
>> as we try to make optional features, e.g. low-level S3 / libhdfs
>> integration, "opt-in" versus required for all users). I'd like to hear
>> from some others, though (e.g. Uwe, Micah, etc.).
>>
>> thanks,
>> Wes
>>
>
>

Re: IO considerations for PyArrow

Reply via email to