Hi folks, I'm not sure Kudu integration is a high priority yet for most of us working on the project, but we'd be happy to have contributions from the Arrow community.
My worry about the "standard network interface" approach is that it's going to be rather hard to keep it efficient while also giving the same functionality as our rather-thick client. To take a particular example, imagine you have a Kudu table created as: CREATE TABLE metrics ( host string, metric string, ts unix_microtime, value double, PRIMARY KEY (ts, metric, host) ) PARTITION BY HASH(host) PARTITIONS 100, RANGE(ts) PARTITIONS (...); and then someone using Python wants to extract all of the information pertaining to WHERE ts BETWEEN '2017-01-01' AND '2017-01-30' AND HOST = ' foo.com'; When specifying these predicates in the Kudu client API, the thick client is smart enough to do hash-partition pruning based on the equality predicate, and range partition pruning based on the range predicate. It also pushes both predicates down to the relevant backend servers so they can evaluate them and return the minimum amount of data. If we're trying to get all of this logic into a REST (or other standardized) API, it seems like the only answer is to have a proxy, which introduces an extra hop of communication and harms some of the "as few copies as possible" magic of Arrow. Is the idea that this standardized RPC interface would be mostly used for simpler single-node systems, and that multi-node storage systems like Kudu/HBase/etc would continue to use thick clients? Aside from the issue of extra copies and deployment complexity of a proxy, my other fear is that we'll end up with least-common-denominator APIs available to users. For example, if we add a new predicate type for Kudu which isn't supported in other storage engines, would we allow it to be added to the Arrow RPC spec? Similarly for other features such as getting storage-engine-specific statistics along with scanner results (eg cache hit/miss ratios per query). My viewpoint has always been that we'd be adding new functionality to our existing client APIs like: KuduScanner scanner(table); scanner.AddPredicate(table.NewEqualityPredicate("host", "foo.com")); scanner.OpenWithFormat(ARROW); ArrowBatch batch; while (scanner.ReadBatch(&batch)) { //... use Arrow APIs to play with batch } we're already doing something similar (albeit with a row-based format) to send raw batches of tuple data to Impala. This seems to provide the best combination of (a) minimal copies, (b) maximal flexibility of APIs, and (c) smallest overlap in scope (and duplicate work) between Arrow and other projects. The downside of course is that it limits the number of languages that Kudu can interface with to the set of languages with existing bindings. However, given C++ client bindings in Kudu it's not too tough to add wrappers in other languages. -Todd On Sun, Mar 19, 2017 at 11:46 AM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Julien, > > Having standard RPC/REST messaging protocols for systems to implement > sounds like a great idea to me. Some systems might choose to pack > Arrow files or streams into a Protocol Buffer or Thrift message, but > it would be good to have a "native" protocol for the streaming file > format in particular. > > I will be happy to provide feedback on a spec for this and to help > soliciting input from other projects which may use the spec. > > Thanks, > Wes > > On Wed, Mar 15, 2017 at 11:02 PM, Julien Le Dem <jul...@dremio.com> wrote: > > We’re working on finalizing a few types and writing the integration tests > > that go with them. > > > > At this point we have a solid foundation in the Arrow project. > > > > As a next step I’m going to look into adding an Arrow RPC/REST interface > > dedicated to data retrieval. > > > > We had several discussions about this and I’m going to formalize a spec > and > > ask for review. > > > > This Arrow based data access interface is intended to be used by systems > > that need access to data for processing (SQL engines, processing > > frameworks, …) and implemented by storage layers or really anything that > > can produce data (including processing frameworks return result sets for > > example). That will greatly simplify integration between the many actors > in > > each category. > > > > The basic premise is to be able to fetch data in Arrow format while > > benefitting from the no-overhead serialization deserialization and > getting > > the data in columnar format. > > > > Some obvious topics that come to mind: > > > > - How do we identify a dataset? > > > > - How do we specify projections? > > > > - What about predicate push downs or in general parameters? > > > > - What underlying protocol to use? HTTP2? > > > > - push vs pull? > > > > - build a reference implementation (Suggestions?) > > > > Potential candidates for using this: > > > > - to consume data or to expose result sets: Drill, Hive, Presto, Impala, > > Spark, RecordService... > > - as a server: Kudu, HBase, Cassandra, … > > > > -- > > Julien > -- Todd Lipcon Software Engineer, Cloudera