Re: Arrow based data access

Jacques Nadeau Thu, 23 Mar 2017 15:42:53 -0700

Julien, this is a great idea. With the chicken and egg nature of Arrow,
having an easy way to connect systems in a generically should help people
make progress. I agree with Todd that there is some danger of this being
lowest common denominator. That being said, I think it is still a great
place to start. One of the things I think people need to have to be able to
understand how to apply Arrow to their own use case is a working version of
things. They can then pick and choose the components that they find useful
and create their own recipe. I saw this happen many times with Drill being
an example of how to integrate Calcite into a separate system. We need to
start bootstrapping these integrations and a REST-like Arrow flow.  As a
starting point, I would think there would need to be something that
supported the following pattern.



1. Client -> Server: Give me an Arrow stream's Schema
2. Client <- Server: Here is an Arrow Schema
3. Client -> Server: Send me an Arrow stream.
4. Client <- Server: Here is an Arrow Stream

I think the "endpoint" of the Arrow stream should allow arbitrary custom
properties to configure in the case of a particular system. For example,
Kudu may want to have endpoints include predicates/projections. Some other
system might simply have named "endpoint"s such as my.co/marketing and
my.co/sales and no need for additional properties. For a Streaming system,
one could imagine a recency filter or last read token.

One of our goals is for the "next Kudu" to be able to build on top of this
framework. Supplanting or supplementing the existing Kudu paradigm has the
same challenge as any library: if you already have a custom approach that
works well, why rip and replace for a common library. (There are clear
answers to this but its the cost/benefit that will always exist.)

Using Kudu as a tool for thought experiment: One way this might work is
implementing a routing layer on top of the generic Arrow REST framework.
For a given user request, the client implementor would route portions of
requests to a set of separate Arrow REST endpoints for different tablet
servers and then concatenate that stream as necessary. Basically using
Arrow REST framework could work kind of like GRPC plus the semantics of
large column-structured record streams.

But I don't think that makes sense as a first version integration. We
should focus on something simpler. For example, a Arrow REST endpoint that
serves our columnar JSON representation from disk as an Arrow record stream
and a js/browser or Python consumer. Even with this very simple initial
implementation, we can figure out things like REST v1 versus v2
integration, multiple language bindings, and backpressure/batching
approaches.

Once we have an initial skeleton, we can look at how to enhance for a first
real use case. How about a Spark or Impala query endpoint? Or?





On Sun, Mar 19, 2017 at 1:06 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi folks,
>
> I'm not sure Kudu integration is a high priority yet for most of us working
> on the project, but we'd be happy to have contributions from the Arrow
> community.
>
> My worry about the "standard network interface" approach is that it's going
> to be rather hard to keep it efficient while also giving the same
> functionality as our rather-thick client.
>
> To take a particular example, imagine you have a Kudu table created as:
>
> CREATE TABLE metrics (
>   host string,
>   metric string,
>   ts unix_microtime,
>   value double,
>   PRIMARY KEY (ts, metric, host)
> ) PARTITION BY HASH(host) PARTITIONS 100,
>   RANGE(ts) PARTITIONS (...);
>
> and then someone using Python wants to extract all of the information
> pertaining to WHERE ts BETWEEN '2017-01-01' AND '2017-01-30' AND HOST = '
> foo.com';
>
> When specifying these predicates in the Kudu client API, the thick client
> is smart enough to do hash-partition pruning based on the equality
> predicate, and range partition pruning based on the range predicate. It
> also pushes both predicates down to the relevant backend servers so they
> can evaluate them and return the minimum amount of data.
>
> If we're trying to get all of this logic into a REST (or other
> standardized) API, it seems like the only answer is to have a proxy, which
> introduces an extra hop of communication and harms some of the "as few
> copies as possible" magic of Arrow.
>
> Is the idea that this standardized RPC interface would be mostly used for
> simpler single-node systems, and that multi-node storage systems like
> Kudu/HBase/etc would continue to use thick clients?
>
> Aside from the issue of extra copies and deployment complexity of a proxy,
> my other fear is that we'll end up with least-common-denominator APIs
> available to users. For example, if we add a new predicate type for Kudu
> which isn't supported in other storage engines, would we allow it to be
> added to the Arrow RPC spec? Similarly for other features such as getting
> storage-engine-specific statistics along with scanner results (eg cache
> hit/miss ratios per query).
>
> My viewpoint has always been that we'd be adding new functionality to our
> existing client APIs like:
>
> KuduScanner scanner(table);
> scanner.AddPredicate(table.NewEqualityPredicate("host", "foo.com"));
> scanner.OpenWithFormat(ARROW);
> ArrowBatch batch;
> while (scanner.ReadBatch(&batch)) {
>   //... use Arrow APIs to play with batch
> }
>
> we're already doing something similar (albeit with a row-based format) to
> send raw batches of tuple data to Impala.
>
> This seems to provide the best combination of (a) minimal copies, (b)
> maximal flexibility of APIs, and (c) smallest overlap in scope (and
> duplicate work) between Arrow and other projects.
>
> The downside of course is that it limits the number of languages that Kudu
> can interface with to the set of languages with existing bindings. However,
> given C++ client bindings in Kudu it's not too tough to add wrappers in
> other languages.
>
> -Todd
>
> On Sun, Mar 19, 2017 at 11:46 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
>
> > hi Julien,
> >
> > Having standard RPC/REST messaging protocols for systems to implement
> > sounds like a great idea to me. Some systems might choose to pack
> > Arrow files or streams into a Protocol Buffer or Thrift message, but
> > it would be good to have a "native" protocol for the streaming file
> > format in particular.
> >
> > I will be happy to provide feedback on a spec for this and to help
> > soliciting input from other projects which may use the spec.
> >
> > Thanks,
> > Wes
> >
> > On Wed, Mar 15, 2017 at 11:02 PM, Julien Le Dem <jul...@dremio.com>
> wrote:
> > > We’re working on finalizing a few types and writing the integration
> tests
> > > that go with them.
> > >
> > > At this point we have a solid foundation in the Arrow project.
> > >
> > > As a next step I’m going to look into adding an Arrow RPC/REST
> interface
> > > dedicated to data retrieval.
> > >
> > > We had several discussions about this and I’m going to formalize a spec
> > and
> > > ask for review.
> > >
> > > This Arrow based data access interface is intended to be used by
> systems
> > > that need access to data for processing (SQL engines, processing
> > > frameworks, …) and implemented by storage layers or really anything
> that
> > > can produce data (including processing frameworks return result sets
> for
> > > example). That will greatly simplify integration between the many
> actors
> > in
> > > each category.
> > >
> > > The basic premise is to be able to fetch data in Arrow format while
> > > benefitting from the no-overhead serialization deserialization and
> > getting
> > > the data in columnar format.
> > >
> > > Some obvious topics that come to mind:
> > >
> > > - How do we identify a dataset?
> > >
> > > - How do we specify projections?
> > >
> > > - What about predicate push downs or in general parameters?
> > >
> > > - What underlying protocol to use? HTTP2?
> > >
> > > - push vs pull?
> > >
> > > - build a reference implementation (Suggestions?)
> > >
> > > Potential candidates for using this:
> > >
> > > - to consume data or to expose result sets: Drill, Hive, Presto,
> Impala,
> > > Spark, RecordService...
> > > - as a server: Kudu, HBase, Cassandra, …
> > >
> > > --
> > > Julien
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Arrow based data access

Reply via email to