Re: [Gandiva] Representing logical query plans in protobuf

Neville Dipale Sat, 05 Jan 2019 12:37:38 -0800

Hi Andy & Wes,

Apologies if I go off-topic a bit, I hope my thoughts are related though.

I'm a new contributor to Arrow, but I've been using and following it since
the feather days. I'm interested in contributing to Rust, as that aligns
more with my day job(s).

I think we (rather, the Rust contributors) can benefit from direction via a
roadmap for a few releases (or for 2019), so that new contributors can find
it easier to add value.

What I've observed so far (on Rust and sometimes other languages) is that
although there's the grand goal that exists (e.g. Ursa Labs' intentions), a
lot of work scheduling is haphazard. For example, a lot of JIRAs are opened
by developers, and then a PR is submitted not long after. Bug reports are
exceptions. This phenomenon, even if minor, makes it difficult for someone
to pick up work and contribute.

I would propose the following for Rust and other less-maturely-supported
languages like C#:

1. We look at gaps relative to python/cpp feature-wise, and create JIRAs
for functionality that doesn't yet exist. For example, Rust doesn't have
date/time support (I created a JIRA a few weeks ago)
2. For some of these features where more effort is required, provide some
rough outline of what needs to be done.
3. For components/features that are common across languages (CSV), agree on
overall design which languages can abide to as far as possible. CPP might
be the template, but it's already likely that Go and Rust are doing their
own thing, which might lead to inconsistent UX to Arrow users down the
line. Such a design might already exist, but I haven't seen anything yet.
This can include creating common test data like is being done with Parquet.

Beyond in-memory rep, computing kernels are the hot thing on Arrow right
now, with Gandiva being the crown jewel. We currently have *array_ops* in
Rust, where Andy's been adding some operations (sum, add, mul, etc.).

4. I think we need some explicit decision-making on whether to continue
this route, which might not be mutually exclusive to future Gandiva
bindings (based on Wes' comments on what Gandiva's role is).
5. If *array_ops* is the way to go, we could define the types of ops we
want to support. This could be as easy as looking at what Pandas or Spark
support. Having a growing suite of functions could encourage users to build
on Arrow like DataFusion is doing. This would also help Andy push the SQL
parser and DF's query engine (per goals and roadmap) while other people do
the grunt-work of various functions.

I believe the above could make it clearer for newbies like me, to
contribute more to Arrow, and give us a better sense of what we can and
can't do with Arrow in our daily applications.

Thanks
Neville

On Sat, 5 Jan 2019 at 20:29, Andy Grove <andygrov...@gmail.com> wrote:

> Wes,
>
> That makes sense.
>
> I'll create a fresh PR to add a new protobuf under the Rust module for now
> (even though this won't be Rust specific).
>
> Thanks,
>
> Andy.
>
>
> On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hey Andy,
> >
> > I replied on GitHub and then saw your e-mail thread.
> >
> > The Gandiva library as it stands right now is not a query engine or an
> > execution engine, properly speaking. It is a subgraph compiler for
> > creating accelerated expressions for use inside another execution or
> > query engine, like it is being used now in Dremio.
> >
> > For this reason I am -1 on adding logical query plan definitions to
> > Gandiva until a more rigorous design effort takes place to decide
> > where to build an actual query/execution engine (which includes file /
> > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > C++. My preference is to start building a from-the-ground-up system
> > that will depend on Gandiva to compile expressions during execution.
> > Among other things, I don't think it is necessarily a good idea to
> > require a query engine to depend on LLVM, so tight coupling to an
> > LLVM-based component may not be desirable.
> >
> > In the meantime, if you want to start creating an (experimental)
> > Protobuf / Flatbuffer definition to define a general query execution
> > plan (that lives outside Gandiva for the time being) to assist with
> > building a query engine in Rust, I think that is fine, but I want to
> > make sure we are being deliberate and layering the project components
> > in a good way
> >
> > - Wes
> >
> > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <andygrov...@gmail.com> wrote:
> > >
> > > I have created a PR to start a discussion around representing logical
> > query
> > > plans in Gandiva (ARROW-4163).
> > >
> > > https://github.com/apache/arrow/pull/3319
> > >
> > > I think that adding the various steps such as projection, selection,
> > sort,
> > > and so on are fairly simple and not contentious. The harder part is how
> > we
> > > represent data sources since this likely has different meanings to
> > > different use cases. My thought is that we can register data sources by
> > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this into the
> > IPC
> > > meta-data somehow so we can pass memory addresses and schema
> information.
> > >
> > > I would love to hear others thoughts on this.
> > >
> > > Thanks,
> > >
> > > Andy.
> >
>

Re: [Gandiva] Representing logical query plans in protobuf

Reply via email to