hi folks, I wanted to bring up what is likely to become an issue very soon in the context of our work to provide an Arrow-based Parquet interface for Python Arrow users.
https://github.com/apache/arrow/pull/83 At the moment, parquet-cpp features an API that enables reading a file from local disk (using C standard library calls): https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111 This is fine for now, however we will quickly need to deal with a few additional sources of data: 1) File-like Python objects (i.e. an object that has `seek`, `tell`, and `read` methods) 2) Remote blob stores: HDFS and S3 Implementing #1 at present is a routine exercise in using the Python C API. #2 is less so -- one of the approaches that has been taken by others is to create separate Python file-like wrapper classes for remote storage to make them seem file like. This has multiple downsides: - read/seek/tell calls must cross up into the Python interpreter and back down into the C++ layer - bytes buffered by read calls get copied into Python bytes objects (see PyBytes_FromStringAndSize) Outside of the GIL / concurrency issues, there's efficiency loss that can be remedied by implementing instead: - Direct C/C++-level interface (independent of Python interpreter) with remote blob stores. These can then buffer bytes directly in the form requested by other C++ consumer libraries (like parquet-cpp) - Implement a Python file-like interface, so that users can still get at the bytes in pure Python if they want (for example: some functions, like pandas.read_csv, primarily deal with file-like things) This is a clearly superior solution, and has been notably pursued in recent times by Dato's SFrame library (BSD 3-clause): https://github.com/dato-code/SFrame/tree/master/oss_src/fileio The problem, however, is the inevitable scope creep for the PyArrow Python package. Unlike some other programming languages, Python programmers face a substantial development complexity burden if they choose to break libraries containing C extensions into smaller components, as libraries must define "internal" C APIs for each other to connect together . Notable example is NumPy (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C API is already being used in PyArrow. I've been thinking about this problem for several weeks, and my net recommendation is that we embrace the scope creep in PyArrow (as long as we try to make optional features, e.g. low-level S3 / libhdfs integration, "opt-in" versus required for all users). I'd like to hear from some others, though (e.g. Uwe, Micah, etc.). thanks, Wes