Re: Arrow based data access

Todd Lipcon Sun, 19 Mar 2017 13:07:40 -0700

Hi folks,

I'm not sure Kudu integration is a high priority yet for most of us working
on the project, but we'd be happy to have contributions from the Arrow
community.

My worry about the "standard network interface" approach is that it's going
to be rather hard to keep it efficient while also giving the same
functionality as our rather-thick client.

To take a particular example, imagine you have a Kudu table created as:

CREATE TABLE metrics (
  host string,
  metric string,
  ts unix_microtime,
  value double,
  PRIMARY KEY (ts, metric, host)
) PARTITION BY HASH(host) PARTITIONS 100,
  RANGE(ts) PARTITIONS (...);

and then someone using Python wants to extract all of the information
pertaining to WHERE ts BETWEEN '2017-01-01' AND '2017-01-30' AND HOST = '
foo.com';

When specifying these predicates in the Kudu client API, the thick client
is smart enough to do hash-partition pruning based on the equality
predicate, and range partition pruning based on the range predicate. It
also pushes both predicates down to the relevant backend servers so they
can evaluate them and return the minimum amount of data.

If we're trying to get all of this logic into a REST (or other
standardized) API, it seems like the only answer is to have a proxy, which
introduces an extra hop of communication and harms some of the "as few
copies as possible" magic of Arrow.

Is the idea that this standardized RPC interface would be mostly used for
simpler single-node systems, and that multi-node storage systems like
Kudu/HBase/etc would continue to use thick clients?

Aside from the issue of extra copies and deployment complexity of a proxy,
my other fear is that we'll end up with least-common-denominator APIs
available to users. For example, if we add a new predicate type for Kudu
which isn't supported in other storage engines, would we allow it to be
added to the Arrow RPC spec? Similarly for other features such as getting
storage-engine-specific statistics along with scanner results (eg cache
hit/miss ratios per query).

My viewpoint has always been that we'd be adding new functionality to our
existing client APIs like:

KuduScanner scanner(table);
scanner.AddPredicate(table.NewEqualityPredicate("host", "foo.com"));
scanner.OpenWithFormat(ARROW);
ArrowBatch batch;
while (scanner.ReadBatch(&batch)) {
  //... use Arrow APIs to play with batch
}

we're already doing something similar (albeit with a row-based format) to
send raw batches of tuple data to Impala.

This seems to provide the best combination of (a) minimal copies, (b)
maximal flexibility of APIs, and (c) smallest overlap in scope (and
duplicate work) between Arrow and other projects.

The downside of course is that it limits the number of languages that Kudu
can interface with to the set of languages with existing bindings. However,
given C++ client bindings in Kudu it's not too tough to add wrappers in
other languages.

-Todd

On Sun, Mar 19, 2017 at 11:46 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Julien,
>
> Having standard RPC/REST messaging protocols for systems to implement
> sounds like a great idea to me. Some systems might choose to pack
> Arrow files or streams into a Protocol Buffer or Thrift message, but
> it would be good to have a "native" protocol for the streaming file
> format in particular.
>
> I will be happy to provide feedback on a spec for this and to help
> soliciting input from other projects which may use the spec.
>
> Thanks,
> Wes
>
> On Wed, Mar 15, 2017 at 11:02 PM, Julien Le Dem <jul...@dremio.com> wrote:
> > We’re working on finalizing a few types and writing the integration tests
> > that go with them.
> >
> > At this point we have a solid foundation in the Arrow project.
> >
> > As a next step I’m going to look into adding an Arrow RPC/REST interface
> > dedicated to data retrieval.
> >
> > We had several discussions about this and I’m going to formalize a spec
> and
> > ask for review.
> >
> > This Arrow based data access interface is intended to be used by systems
> > that need access to data for processing (SQL engines, processing
> > frameworks, …) and implemented by storage layers or really anything that
> > can produce data (including processing frameworks return result sets for
> > example). That will greatly simplify integration between the many actors
> in
> > each category.
> >
> > The basic premise is to be able to fetch data in Arrow format while
> > benefitting from the no-overhead serialization deserialization and
> getting
> > the data in columnar format.
> >
> > Some obvious topics that come to mind:
> >
> > - How do we identify a dataset?
> >
> > - How do we specify projections?
> >
> > - What about predicate push downs or in general parameters?
> >
> > - What underlying protocol to use? HTTP2?
> >
> > - push vs pull?
> >
> > - build a reference implementation (Suggestions?)
> >
> > Potential candidates for using this:
> >
> > - to consume data or to expose result sets: Drill, Hive, Presto, Impala,
> > Spark, RecordService...
> > - as a server: Kudu, HBase, Cassandra, …
> >
> > --
> > Julien
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Arrow based data access

Reply via email to