Hi Wes, At what level do you imagine, the "opt-in" happening. Right now it seems like it would be fairly straightforward at build time. However, when we start packaging pyarrow for distribution how do you imagine it will work? (If [1] already answers this, please let me know, I've been meaning to take a look at it).
I need to grok the python code base a little bit more to understand the implications of the scope creep and the pain around taking a more fine-grained component approach. But in general my experience has been that packaging things together while maintaining clear internal code boundaries for later separation is a good pragmatic approach. As a side note, hopefully, we'll be able to re-use some existing projects to do the heavy lifting for blob store integration. SFrame is one option [2] and [3] might be worth investigating as well (both appear to be Apache 2.0 licensed). Thanks, -Micah [1] https://github.com/apache/arrow/pull/79/files [2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 [3] https://github.com/aws/aws-sdk-cpp On Tue, May 31, 2016 at 5:44 PM, Wes McKinney <[email protected]> wrote: > hi folks, > > I wanted to bring up what is likely to become an issue very soon in > the context of our work to provide an Arrow-based Parquet interface > for Python Arrow users. > > https://github.com/apache/arrow/pull/83 > > At the moment, parquet-cpp features an API that enables reading a file > from local disk (using C standard library calls): > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111 > > This is fine for now, however we will quickly need to deal with a few > additional sources of data: > > 1) File-like Python objects (i.e. an object that has `seek`, `tell`, > and `read` methods) > 2) Remote blob stores: HDFS and S3 > > Implementing #1 at present is a routine exercise in using the Python C > API. #2 is less so -- one of the approaches that has been taken by > others is to create separate Python file-like wrapper classes for > remote storage to make them seem file like. This has multiple > downsides: > > - read/seek/tell calls must cross up into the Python interpreter and > back down into the C++ layer > - bytes buffered by read calls get copied into Python bytes objects > (see PyBytes_FromStringAndSize) > > Outside of the GIL / concurrency issues, there's efficiency loss that > can be remedied by implementing instead: > > - Direct C/C++-level interface (independent of Python interpreter) > with remote blob stores. These can then buffer bytes directly in the > form requested by other C++ consumer libraries (like parquet-cpp) > > - Implement a Python file-like interface, so that users can still get > at the bytes in pure Python if they want (for example: some functions, > like pandas.read_csv, primarily deal with file-like things) > > This is a clearly superior solution, and has been notably pursued in > recent times by Dato's SFrame library (BSD 3-clause): > > https://github.com/dato-code/SFrame/tree/master/oss_src/fileio > > The problem, however, is the inevitable scope creep for the PyArrow > Python package. Unlike some other programming languages, Python > programmers face a substantial development complexity burden if they > choose to break libraries containing C extensions into smaller > components, as libraries must define "internal" C APIs for each other > to connect together . Notable example is NumPy > (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C > API is already being used in PyArrow. > > I've been thinking about this problem for several weeks, and my net > recommendation is that we embrace the scope creep in PyArrow (as long > as we try to make optional features, e.g. low-level S3 / libhdfs > integration, "opt-in" versus required for all users). I'd like to hear > from some others, though (e.g. Uwe, Micah, etc.). > > thanks, > Wes
