hi Andy, On Sat, Jan 5, 2019 at 3:59 PM Andy Grove <andygrov...@gmail.com> wrote: > > Thanks Neville for starting this discussion. > > The next set of things I am interested now that we have some primitive > operators in place is performing aggregate queries over a sequence of > RecordBatches (in fact I just got that working in DataFusion this morning) > and then moving onto other SQL features such as ORDER BY and then adding > support for scalar and array UDFs (supporting native Rust functions and/or > calling shared objects that can be authored in C/C++ or Rust). > > I am also interested in the arrow reader for Parquet that is in progress, > and then introducing a common data source trait to wrap CSV, Parquet, and > future data sources. > > I like the idea of being about to define logical query plans in the Arrow > project but maybe that only makes sense if there is an execution engine to > use it. > > I am open to donating the current DataFusion code as a Rust-native > execution engine for Arrow if that makes sense (currently supports > projection, selection, and simple aggregates). Maybe having SQL directly in > the Arrow project is going too far? However, the logical query plan + > execution might make good sense.
This sounds pretty interesting. If you have a SQL front end for the query engine and there is a willingness from the Rust developers to maintain the code, it doesn't sound unreasonable. > > Thanks, > > Andy. > > > > > > > > On Sat, Jan 5, 2019 at 2:20 PM Neville Dipale <nevilled...@gmail.com> wrote: > > > Hi Wes, > > > > I'm aware of your expressions re. the amount of work that leadership on OSS > > projects takes, and for the time aspect, one has to just look at another's > > local timezone to see even the hours and days which another works. > > > > To be proactive, I'll hash together such rough roadmap for Rust, and share > > it on the mailing list when it's ready. Where I need guidance on features, > > I'll put in enough research so I don't spend other contributors' time on > > open-ended problems. > > > > Thanks for responding, really appreciate it > > > > Neville > > > > On Sat, 5 Jan 2019 at 23:03, Wes McKinney <wesmck...@gmail.com> wrote: > > > > > hi Neville, > > > > > > On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <nevilled...@gmail.com> > > > wrote: > > > > > > > > Hi Andy & Wes, > > > > > > > > Apologies if I go off-topic a bit, I hope my thoughts are related > > though. > > > > > > > > I'm a new contributor to Arrow, but I've been using and following it > > > since > > > > the feather days. I'm interested in contributing to Rust, as that > > aligns > > > > more with my day job(s). > > > > > > > > I think we (rather, the Rust contributors) can benefit from direction > > > via a > > > > roadmap for a few releases (or for 2019), so that new contributors can > > > find > > > > it easier to add value. > > > > > > > > What I've observed so far (on Rust and sometimes other languages) is > > that > > > > although there's the grand goal that exists (e.g. Ursa Labs' > > > intentions), a > > > > lot of work scheduling is haphazard. For example, a lot of JIRAs are > > > opened > > > > by developers, and then a PR is submitted not long after. Bug reports > > are > > > > exceptions. This phenomenon, even if minor, makes it difficult for > > > someone > > > > to pick up work and contribute. > > > > > > > > I would propose the following for Rust and other > > less-maturely-supported > > > > languages like C#: > > > > > > > > 1. We look at gaps relative to python/cpp feature-wise, and create > > JIRAs > > > > for functionality that doesn't yet exist. For example, Rust doesn't > > have > > > > date/time support (I created a JIRA a few weeks ago) > > > > 2. For some of these features where more effort is required, provide > > some > > > > rough outline of what needs to be done. > > > > 3. For components/features that are common across languages (CSV), > > agree > > > on > > > > overall design which languages can abide to as far as possible. CPP > > might > > > > be the template, but it's already likely that Go and Rust are doing > > their > > > > own thing, which might lead to inconsistent UX to Arrow users down the > > > > line. Such a design might already exist, but I haven't seen anything > > yet. > > > > This can include creating common test data like is being done with > > > Parquet. > > > > > > > > > > I don't mean to dismiss this concern, but leadership (what you are > > > asking for) in any kind of software project (whether open source or > > > not) is a _lot_ of work. If there is not an individual with the time > > > and space to effectively be a "product manager" then this work > > > generally does not happen, and development gets done on an ad hoc > > > basis based on what features people need to build the applications > > > they are working on. > > > > > > One of the roles I've played over the last ~3 years in the project is > > > the chief JIRA wrangler for the C++ and Python implementations. > > > According to > > > > > > https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard > > > > > > I have created almost 1500 JIRAs. If you want this work to happen for > > > Rust or one of the other implementations, generally either you will > > > have to do it, or you will have to find someone to compensate to do > > > it. Otherwise there may be a volunteer who will step up, but there's > > > no guarantee. > > > > > > > Beyond in-memory rep, computing kernels are the hot thing on Arrow > > right > > > > now, with Gandiva being the crown jewel. We currently have *array_ops* > > in > > > > Rust, where Andy's been adding some operations (sum, add, mul, etc.). > > > > > > > > 4. I think we need some explicit decision-making on whether to continue > > > > this route, which might not be mutually exclusive to future Gandiva > > > > bindings (based on Wes' comments on what Gandiva's role is). > > > > > > Apache projects are effectively do-ocracies that operate on the basis > > > of consensus. It is up to the contributors of the subcomponents to > > > self-manage what gets built and what does not get built. Much > > > consensus may be "lazy" (where absence of opinions implies consent). > > > It is a good idea to discuss objectives and requirements on the > > > mailing list so there is a written record of what the consensus was. > > > In the absence of "yay" arguments from contributors, ultimately the > > > decisions get made by the people doing the work. > > > > > > > 5. If *array_ops* is the way to go, we could define the types of ops we > > > > want to support. This could be as easy as looking at what Pandas or > > Spark > > > > support. Having a growing suite of functions could encourage users to > > > build > > > > on Arrow like DataFusion is doing. This would also help Andy push the > > SQL > > > > parser and DF's query engine (per goals and roadmap) while other people > > > do > > > > the grunt-work of various functions. > > > > > > > > I believe the above could make it clearer for newbies like me, to > > > > contribute more to Arrow, and give us a better sense of what we can and > > > > can't do with Arrow in our daily applications. > > > > > > It would be helpful to have a development roadmap for Rust, or for the > > > other language implementations. With luck someone will be able to > > > volunteer to take on this work. > > > > > > - Wes > > > > > > > > > > > Thanks > > > > Neville > > > > > > > > > > > > On Sat, 5 Jan 2019 at 20:29, Andy Grove <andygrov...@gmail.com> wrote: > > > > > > > > > Wes, > > > > > > > > > > That makes sense. > > > > > > > > > > I'll create a fresh PR to add a new protobuf under the Rust module > > for > > > now > > > > > (even though this won't be Rust specific). > > > > > > > > > > Thanks, > > > > > > > > > > Andy. > > > > > > > > > > > > > > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > > hey Andy, > > > > > > > > > > > > I replied on GitHub and then saw your e-mail thread. > > > > > > > > > > > > The Gandiva library as it stands right now is not a query engine or > > > an > > > > > > execution engine, properly speaking. It is a subgraph compiler for > > > > > > creating accelerated expressions for use inside another execution > > or > > > > > > query engine, like it is being used now in Dremio. > > > > > > > > > > > > For this reason I am -1 on adding logical query plan definitions to > > > > > > Gandiva until a more rigorous design effort takes place to decide > > > > > > where to build an actual query/execution engine (which includes > > file > > > / > > > > > > dataset scanners, projections, joins, aggregates, filters, etc.) in > > > > > > C++. My preference is to start building a from-the-ground-up system > > > > > > that will depend on Gandiva to compile expressions during > > execution. > > > > > > Among other things, I don't think it is necessarily a good idea to > > > > > > require a query engine to depend on LLVM, so tight coupling to an > > > > > > LLVM-based component may not be desirable. > > > > > > > > > > > > In the meantime, if you want to start creating an (experimental) > > > > > > Protobuf / Flatbuffer definition to define a general query > > execution > > > > > > plan (that lives outside Gandiva for the time being) to assist with > > > > > > building a query engine in Rust, I think that is fine, but I want > > to > > > > > > make sure we are being deliberate and layering the project > > components > > > > > > in a good way > > > > > > > > > > > > - Wes > > > > > > > > > > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <andygrov...@gmail.com> > > > wrote: > > > > > > > > > > > > > > I have created a PR to start a discussion around representing > > > logical > > > > > > query > > > > > > > plans in Gandiva (ARROW-4163). > > > > > > > > > > > > > > https://github.com/apache/arrow/pull/3319 > > > > > > > > > > > > > > I think that adding the various steps such as projection, > > > selection, > > > > > > sort, > > > > > > > and so on are fairly simple and not contentious. The harder part > > > is how > > > > > > we > > > > > > > represent data sources since this likely has different meanings > > to > > > > > > > different use cases. My thought is that we can register data > > > sources by > > > > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this > > into > > > the > > > > > > IPC > > > > > > > meta-data somehow so we can pass memory addresses and schema > > > > > information. > > > > > > > > > > > > > > I would love to hear others thoughts on this. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Andy. > > > > > > > > > > > > > > > >