hi Andy,

On Sat, Jan 5, 2019 at 3:59 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> Thanks Neville for starting this discussion.
>
> The next set of things I am interested now that we have some primitive
> operators in place is performing aggregate queries over a sequence of
> RecordBatches (in fact I just got that working in DataFusion this morning)
> and then moving onto other SQL features such as ORDER BY and then adding
> support for scalar and array UDFs (supporting native Rust functions and/or
> calling shared objects that can be authored in C/C++ or Rust).
>
> I am also interested in the arrow reader for Parquet that is in progress,
> and then introducing a common data source trait to wrap CSV, Parquet, and
> future data sources.
>
> I like the idea of being about to define logical query plans in the Arrow
> project but maybe that only makes sense if there is an execution engine to
> use it.
>
> I am open to donating the current DataFusion code as a Rust-native
> execution engine for Arrow if that makes sense (currently supports
> projection, selection, and simple aggregates). Maybe having SQL directly in
> the Arrow project is going too far? However, the logical query plan +
> execution might make good sense.

This sounds pretty interesting.

If you have a SQL front end for the query engine and there is a
willingness from the Rust developers to maintain the code, it doesn't
sound unreasonable.

>
> Thanks,
>
> Andy.
>
>
>
>
>
>
>
> On Sat, Jan 5, 2019 at 2:20 PM Neville Dipale <nevilled...@gmail.com> wrote:
>
> > Hi Wes,
> >
> > I'm aware of your expressions re. the amount of work that leadership on OSS
> > projects takes, and for the time aspect, one has to just look at another's
> > local timezone to see even the hours and days which another works.
> >
> > To be proactive, I'll hash together such rough roadmap for Rust, and share
> > it on the mailing list when it's ready. Where I need guidance on features,
> > I'll put in enough research so I don't spend other contributors' time on
> > open-ended problems.
> >
> > Thanks for responding, really appreciate it
> >
> > Neville
> >
> > On Sat, 5 Jan 2019 at 23:03, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi Neville,
> > >
> > > On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <nevilled...@gmail.com>
> > > wrote:
> > > >
> > > > Hi Andy & Wes,
> > > >
> > > > Apologies if I go off-topic a bit, I hope my thoughts are related
> > though.
> > > >
> > > > I'm a new contributor to Arrow, but I've been using and following it
> > > since
> > > > the feather days. I'm interested in contributing to Rust, as that
> > aligns
> > > > more with my day job(s).
> > > >
> > > > I think we (rather, the Rust contributors) can benefit from direction
> > > via a
> > > > roadmap for a few releases (or for 2019), so that new contributors can
> > > find
> > > > it easier to add value.
> > > >
> > > > What I've observed so far (on Rust and sometimes other languages) is
> > that
> > > > although there's the grand goal that exists (e.g. Ursa Labs'
> > > intentions), a
> > > > lot of work scheduling is haphazard. For example, a lot of JIRAs are
> > > opened
> > > > by developers, and then a PR is submitted not long after. Bug reports
> > are
> > > > exceptions. This phenomenon, even if minor, makes it difficult for
> > > someone
> > > > to pick up work and contribute.
> > > >
> > > > I would propose the following for Rust and other
> > less-maturely-supported
> > > > languages like C#:
> > > >
> > > > 1. We look at gaps relative to python/cpp feature-wise, and create
> > JIRAs
> > > > for functionality that doesn't yet exist. For example, Rust doesn't
> > have
> > > > date/time support (I created a JIRA a few weeks ago)
> > > > 2. For some of these features where more effort is required, provide
> > some
> > > > rough outline of what needs to be done.
> > > > 3. For components/features that are common across languages (CSV),
> > agree
> > > on
> > > > overall design which languages can abide to as far as possible. CPP
> > might
> > > > be the template, but it's already likely that Go and Rust are doing
> > their
> > > > own thing, which might lead to inconsistent UX to Arrow users down the
> > > > line. Such a design might already exist, but I haven't seen anything
> > yet.
> > > > This can include creating common test data like is being done with
> > > Parquet.
> > > >
> > >
> > > I don't mean to dismiss this concern, but leadership (what you are
> > > asking for) in any kind of software project (whether open source or
> > > not) is a _lot_ of work. If there is not an individual with the time
> > > and space to effectively be a "product manager" then this work
> > > generally does not happen, and development gets done on an ad hoc
> > > basis based on what features people need to build the applications
> > > they are working on.
> > >
> > > One of the roles I've played over the last ~3 years in the project is
> > > the chief JIRA wrangler for the C++ and Python implementations.
> > > According to
> > >
> > > https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard
> > >
> > > I have created almost 1500 JIRAs. If you want this work to happen for
> > > Rust or one of the other implementations, generally either you will
> > > have to do it, or you will have to find someone to compensate to do
> > > it. Otherwise there may be a volunteer who will step up, but there's
> > > no guarantee.
> > >
> > > > Beyond in-memory rep, computing kernels are the hot thing on Arrow
> > right
> > > > now, with Gandiva being the crown jewel. We currently have *array_ops*
> > in
> > > > Rust, where Andy's been adding some operations (sum, add, mul, etc.).
> > > >
> > > > 4. I think we need some explicit decision-making on whether to continue
> > > > this route, which might not be mutually exclusive to future Gandiva
> > > > bindings (based on Wes' comments on what Gandiva's role is).
> > >
> > > Apache projects are effectively do-ocracies that operate on the basis
> > > of consensus. It is up to the contributors of the subcomponents to
> > > self-manage what gets built and what does not get built. Much
> > > consensus may be "lazy" (where absence of opinions implies consent).
> > > It is a good idea to discuss objectives and requirements on the
> > > mailing list so there is a written record of what the consensus was.
> > > In the absence of "yay" arguments from contributors, ultimately the
> > > decisions get made by the people doing the work.
> > >
> > > > 5. If *array_ops* is the way to go, we could define the types of ops we
> > > > want to support. This could be as easy as looking at what Pandas or
> > Spark
> > > > support. Having a growing suite of functions could encourage users to
> > > build
> > > > on Arrow like DataFusion is doing. This would also help Andy push the
> > SQL
> > > > parser and DF's query engine (per goals and roadmap) while other people
> > > do
> > > > the grunt-work of various functions.
> > > >
> > > > I believe the above could make it clearer for newbies like me, to
> > > > contribute more to Arrow, and give us a better sense of what we can and
> > > > can't do with Arrow in our daily applications.
> > >
> > > It would be helpful to have a development roadmap for Rust, or for the
> > > other language implementations. With luck someone will be able to
> > > volunteer to take on this work.
> > >
> > > - Wes
> > >
> > > >
> > > > Thanks
> > > > Neville
> > > >
> > > >
> > > > On Sat, 5 Jan 2019 at 20:29, Andy Grove <andygrov...@gmail.com> wrote:
> > > >
> > > > > Wes,
> > > > >
> > > > > That makes sense.
> > > > >
> > > > > I'll create a fresh PR to add a new protobuf under the Rust module
> > for
> > > now
> > > > > (even though this won't be Rust specific).
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hey Andy,
> > > > > >
> > > > > > I replied on GitHub and then saw your e-mail thread.
> > > > > >
> > > > > > The Gandiva library as it stands right now is not a query engine or
> > > an
> > > > > > execution engine, properly speaking. It is a subgraph compiler for
> > > > > > creating accelerated expressions for use inside another execution
> > or
> > > > > > query engine, like it is being used now in Dremio.
> > > > > >
> > > > > > For this reason I am -1 on adding logical query plan definitions to
> > > > > > Gandiva until a more rigorous design effort takes place to decide
> > > > > > where to build an actual query/execution engine (which includes
> > file
> > > /
> > > > > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > > > > C++. My preference is to start building a from-the-ground-up system
> > > > > > that will depend on Gandiva to compile expressions during
> > execution.
> > > > > > Among other things, I don't think it is necessarily a good idea to
> > > > > > require a query engine to depend on LLVM, so tight coupling to an
> > > > > > LLVM-based component may not be desirable.
> > > > > >
> > > > > > In the meantime, if you want to start creating an (experimental)
> > > > > > Protobuf / Flatbuffer definition to define a general query
> > execution
> > > > > > plan (that lives outside Gandiva for the time being) to assist with
> > > > > > building a query engine in Rust, I think that is fine, but I want
> > to
> > > > > > make sure we are being deliberate and layering the project
> > components
> > > > > > in a good way
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <andygrov...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > I have created a PR to start a discussion around representing
> > > logical
> > > > > > query
> > > > > > > plans in Gandiva (ARROW-4163).
> > > > > > >
> > > > > > > https://github.com/apache/arrow/pull/3319
> > > > > > >
> > > > > > > I think that adding the various steps such as projection,
> > > selection,
> > > > > > sort,
> > > > > > > and so on are fairly simple and not contentious. The harder part
> > > is how
> > > > > > we
> > > > > > > represent data sources since this likely has different meanings
> > to
> > > > > > > different use cases. My thought is that we can register data
> > > sources by
> > > > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this
> > into
> > > the
> > > > > > IPC
> > > > > > > meta-data somehow so we can pass memory addresses and schema
> > > > > information.
> > > > > > >
> > > > > > > I would love to hear others thoughts on this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Andy.
> > > > > >
> > > > >
> > >
> >

Reply via email to