We’re working on finalizing a few types and writing the integration tests that go with them.
At this point we have a solid foundation in the Arrow project. As a next step I’m going to look into adding an Arrow RPC/REST interface dedicated to data retrieval. We had several discussions about this and I’m going to formalize a spec and ask for review. This Arrow based data access interface is intended to be used by systems that need access to data for processing (SQL engines, processing frameworks, …) and implemented by storage layers or really anything that can produce data (including processing frameworks return result sets for example). That will greatly simplify integration between the many actors in each category. The basic premise is to be able to fetch data in Arrow format while benefitting from the no-overhead serialization deserialization and getting the data in columnar format. Some obvious topics that come to mind: - How do we identify a dataset? - How do we specify projections? - What about predicate push downs or in general parameters? - What underlying protocol to use? HTTP2? - push vs pull? - build a reference implementation (Suggestions?) Potential candidates for using this: - to consume data or to expose result sets: Drill, Hive, Presto, Impala, Spark, RecordService... - as a server: Kudu, HBase, Cassandra, … -- Julien