hi folks, We've spent a good amount of energy up until now implementing interfaces for reading different kinds of file formats in C++, like Parquet, ORC, CSV, and JSON. There's some higher level layers missing, through, which are necessary if we want to make use of these file formats in the context of an in-memory query engine. This includes:
* Scanning multiple files as a single logical dataset * Schema normalization and evolution * Handling partitioned datasets, and datasets consistenting of heterogeneous storage (a mix of file formats) * Predicate pushdown: taking row filtering and column selection into account while reading a file We have implemented some parts of this already in limited form for Python users in the pyarrow.parquet module. This is problematic since a) it is implemented in Python and cannot be used by Ruby or R, for example and b) it is specific to a single file format Since this is a large topic, I tried to write up a summary of what I believe to be the important problems that need to be solved: https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing This project will also allow for "user-defined" data sources, so that other people in the Arrow ecosystem can contribute new data interfaces to interact with different kinds of storage systems using a common API, so if they want to "plug in" to any computation layers available in Apache Arrow then there is a reasonably straightforward path to do that. Your comments and ideas on this project would be appreciated. Thank you, Wes