Let me throw a thought - what about looking to support access to different systems (including Alluxio) through a common POSIX interface such as FUSE?
Will there be a significant performance impact or a loss of control of the layout? On Fri, Jun 3, 2016 at 9:26 AM, Uwe Korn <uw...@xhochy.com> wrote: > Hello, > > I would also embrace the scope creep. As we will deal with a lot of data, > the cross-language I/O impact will significantly matter for performance at > the end. We definitely have to be careful in making the dependencies > toggleable in the build system. You should be able to easily get a build > with all dependencies but also it can very selective on which ones are > included in a build. > > For HDFS and S3 support, I'm not sure if either arrow-cpp, pyarrow or > parquet-cpp is the right place for their C++ implementation. For arrow-cpp > it would be the same scope creep as for PyArrow and it could be already > used by C++ Arrow users but in parquet-cpp these IO classes would also be > helpful for the non-arrow users. For the moment I would put the C++ > implementations into arrow-cpp, as this keeps the scope creep in Arrow > itself but already provides value to the C++ users and other languages > building on that layer. > > Cheers, > > Uwe > > > On 01.06.16 02:44, Wes McKinney wrote: > >> hi folks, >> >> I wanted to bring up what is likely to become an issue very soon in >> the context of our work to provide an Arrow-based Parquet interface >> for Python Arrow users. >> >> https://github.com/apache/arrow/pull/83 >> >> At the moment, parquet-cpp features an API that enables reading a file >> from local disk (using C standard library calls): >> >> >> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111 >> >> This is fine for now, however we will quickly need to deal with a few >> additional sources of data: >> >> 1) File-like Python objects (i.e. an object that has `seek`, `tell`, >> and `read` methods) >> 2) Remote blob stores: HDFS and S3 >> >> Implementing #1 at present is a routine exercise in using the Python C >> API. #2 is less so -- one of the approaches that has been taken by >> others is to create separate Python file-like wrapper classes for >> remote storage to make them seem file like. This has multiple >> downsides: >> >> - read/seek/tell calls must cross up into the Python interpreter and >> back down into the C++ layer >> - bytes buffered by read calls get copied into Python bytes objects >> (see PyBytes_FromStringAndSize) >> >> Outside of the GIL / concurrency issues, there's efficiency loss that >> can be remedied by implementing instead: >> >> - Direct C/C++-level interface (independent of Python interpreter) >> with remote blob stores. These can then buffer bytes directly in the >> form requested by other C++ consumer libraries (like parquet-cpp) >> >> - Implement a Python file-like interface, so that users can still get >> at the bytes in pure Python if they want (for example: some functions, >> like pandas.read_csv, primarily deal with file-like things) >> >> This is a clearly superior solution, and has been notably pursued in >> recent times by Dato's SFrame library (BSD 3-clause): >> >> https://github.com/dato-code/SFrame/tree/master/oss_src/fileio >> >> The problem, however, is the inevitable scope creep for the PyArrow >> Python package. Unlike some other programming languages, Python >> programmers face a substantial development complexity burden if they >> choose to break libraries containing C extensions into smaller >> components, as libraries must define "internal" C APIs for each other >> to connect together . Notable example is NumPy >> (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C >> API is already being used in PyArrow. >> >> I've been thinking about this problem for several weeks, and my net >> recommendation is that we embrace the scope creep in PyArrow (as long >> as we try to make optional features, e.g. low-level S3 / libhdfs >> integration, "opt-in" versus required for all users). I'd like to hear >> from some others, though (e.g. Uwe, Micah, etc.). >> >> thanks, >> Wes >> > >