Arrow based data access

Julien Le Dem Wed, 15 Mar 2017 20:02:56 -0700

We’re working on finalizing a few types and writing the integration tests
that go with them.


At this point we have a solid foundation in the Arrow project.

As a next step I’m going to look into adding an Arrow RPC/REST interface
dedicated to data retrieval.

We had several discussions about this and I’m going to formalize a spec and
ask for review.

This Arrow based data access interface is intended to be used by systems
that need access to data for processing (SQL engines, processing
frameworks, …) and implemented by storage layers or really anything that
can produce data (including processing frameworks return result sets for
example). That will greatly simplify integration between the many actors in
each category.

The basic premise is to be able to fetch data in Arrow format while
benefitting from the no-overhead serialization deserialization and getting
the data in columnar format.

Some obvious topics that come to mind:

- How do we identify a dataset?

- How do we specify projections?

- What about predicate push downs or in general parameters?

- What underlying protocol to use? HTTP2?

- push vs pull?

- build a reference implementation (Suggestions?)

Potential candidates for using this:

- to consume data or to expose result sets: Drill, Hive, Presto, Impala,
Spark, RecordService...
- as a server: Kudu, HBase, Cassandra, …

-- 
Julien

Arrow based data access

Reply via email to