hi folks,

We've spent a good amount of energy up until now implementing
interfaces for reading different kinds of file formats in C++, like
Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
through, which are necessary if we want to make use of these file
formats in the context of an in-memory query engine. This includes:

* Scanning multiple files as a single logical dataset
* Schema normalization and evolution
* Handling partitioned datasets, and datasets consistenting of
heterogeneous storage (a mix of file formats)
* Predicate pushdown: taking row filtering and column selection into
account while reading a file

We have implemented some parts of this already in limited form for
Python users in the pyarrow.parquet module. This is problematic since
a) it is implemented in Python and cannot be used by Ruby or R, for
example and b) it is specific to a single file format

Since this is a large topic, I tried to write up a summary of what I
believe to be the important problems that need to be solved:

https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing

This project will also allow for "user-defined" data sources, so that
other people in the Arrow ecosystem can contribute new data interfaces
to interact with different kinds of storage systems using a common
API, so if they want to "plug in" to any computation layers available
in Apache Arrow then there is a reasonably straightforward path to do
that.

Your comments and ideas on this project would be appreciated.

Thank you,
Wes

Reply via email to