Hi Andy & Wes, Apologies if I go off-topic a bit, I hope my thoughts are related though.
I'm a new contributor to Arrow, but I've been using and following it since the feather days. I'm interested in contributing to Rust, as that aligns more with my day job(s). I think we (rather, the Rust contributors) can benefit from direction via a roadmap for a few releases (or for 2019), so that new contributors can find it easier to add value. What I've observed so far (on Rust and sometimes other languages) is that although there's the grand goal that exists (e.g. Ursa Labs' intentions), a lot of work scheduling is haphazard. For example, a lot of JIRAs are opened by developers, and then a PR is submitted not long after. Bug reports are exceptions. This phenomenon, even if minor, makes it difficult for someone to pick up work and contribute. I would propose the following for Rust and other less-maturely-supported languages like C#: 1. We look at gaps relative to python/cpp feature-wise, and create JIRAs for functionality that doesn't yet exist. For example, Rust doesn't have date/time support (I created a JIRA a few weeks ago) 2. For some of these features where more effort is required, provide some rough outline of what needs to be done. 3. For components/features that are common across languages (CSV), agree on overall design which languages can abide to as far as possible. CPP might be the template, but it's already likely that Go and Rust are doing their own thing, which might lead to inconsistent UX to Arrow users down the line. Such a design might already exist, but I haven't seen anything yet. This can include creating common test data like is being done with Parquet. Beyond in-memory rep, computing kernels are the hot thing on Arrow right now, with Gandiva being the crown jewel. We currently have *array_ops* in Rust, where Andy's been adding some operations (sum, add, mul, etc.). 4. I think we need some explicit decision-making on whether to continue this route, which might not be mutually exclusive to future Gandiva bindings (based on Wes' comments on what Gandiva's role is). 5. If *array_ops* is the way to go, we could define the types of ops we want to support. This could be as easy as looking at what Pandas or Spark support. Having a growing suite of functions could encourage users to build on Arrow like DataFusion is doing. This would also help Andy push the SQL parser and DF's query engine (per goals and roadmap) while other people do the grunt-work of various functions. I believe the above could make it clearer for newbies like me, to contribute more to Arrow, and give us a better sense of what we can and can't do with Arrow in our daily applications. Thanks Neville On Sat, 5 Jan 2019 at 20:29, Andy Grove <andygrov...@gmail.com> wrote: > Wes, > > That makes sense. > > I'll create a fresh PR to add a new protobuf under the Rust module for now > (even though this won't be Rust specific). > > Thanks, > > Andy. > > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hey Andy, > > > > I replied on GitHub and then saw your e-mail thread. > > > > The Gandiva library as it stands right now is not a query engine or an > > execution engine, properly speaking. It is a subgraph compiler for > > creating accelerated expressions for use inside another execution or > > query engine, like it is being used now in Dremio. > > > > For this reason I am -1 on adding logical query plan definitions to > > Gandiva until a more rigorous design effort takes place to decide > > where to build an actual query/execution engine (which includes file / > > dataset scanners, projections, joins, aggregates, filters, etc.) in > > C++. My preference is to start building a from-the-ground-up system > > that will depend on Gandiva to compile expressions during execution. > > Among other things, I don't think it is necessarily a good idea to > > require a query engine to depend on LLVM, so tight coupling to an > > LLVM-based component may not be desirable. > > > > In the meantime, if you want to start creating an (experimental) > > Protobuf / Flatbuffer definition to define a general query execution > > plan (that lives outside Gandiva for the time being) to assist with > > building a query engine in Rust, I think that is fine, but I want to > > make sure we are being deliberate and layering the project components > > in a good way > > > > - Wes > > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <andygrov...@gmail.com> wrote: > > > > > > I have created a PR to start a discussion around representing logical > > query > > > plans in Gandiva (ARROW-4163). > > > > > > https://github.com/apache/arrow/pull/3319 > > > > > > I think that adding the various steps such as projection, selection, > > sort, > > > and so on are fairly simple and not contentious. The harder part is how > > we > > > represent data sources since this likely has different meanings to > > > different use cases. My thought is that we can register data sources by > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this into the > > IPC > > > meta-data somehow so we can pass memory addresses and schema > information. > > > > > > I would love to hear others thoughts on this. > > > > > > Thanks, > > > > > > Andy. > > >