hi Ravindra,

On Wed, Feb 13, 2019 at 1:34 AM Ravindra Pindikura <ravin...@dremio.com> wrote:
>
> Hi,
>
> I was looking at the recent checkin for arrow kernels, and started to think 
> of how they would work alongside Gandiva.
>
> Here are my thoughts :
>
> 1. Gandiva already has two high-level operators namely project and filter, 
> with runtime code generation
>
> * It already supports 100s of functions (eg. a+b, a > b), which can be 
> combined into expressions (eg. a+b > c && a +b < d) for each of the operators 
> and we’ll likely continue to add more of them.
> * it works on one record batch at a time - consumes a record batch, and 
> produces a record batch.
> * The operators can be inter-linked (eg. Project -> filter -> project) to 
> build a pipeline.
> * we may build additional operators in the future which could benefit from 
> code generation (eg. Impala uses code generation when parsing Avro files).
>
> 2. Arrow Kernels
>
> a. support project/filter operators
>
> Useful for functions where there is no benefit with code generation, or where 
> code generation can be skipped over (eager evaluation).
>
> b. Support for additional operators like aggregates
>
>
> How do we combine and link the gandiva operators and the kernels ? For eg. It 
> would be nice to have a pipeline with scan (read from source),  project 
> (expression on column), filter (extract rows), and aggregate (sum on the 
> extracted column).
>

I'm just starting on a project now to enable logical expressions to be
built with a C++ API that are formed from a superset of relationship
algebra. In the past, I have already built a fairly deep system for
this (https://github.com/ibis-project/ibis) that captures SQL
semantics while also permitting other kinds of analytical
functionality that spills outside of SQL. As soon as I cobble together
a prototype I'll be interested in lots of feedback since this is a
system that we'll use for many years

My idea of the interplay with Gandiva is that expressions will be
"lowered" to physical execution operators such as:

* Projection (with/without filters)
* Join
* Hash aggregate (with/without filters)
* Sort
* Time series operations (resampling / binning, etc.)

Gandiva therefore serves as a subgraph compiler for constituent parts
of these operations . During the lowering process, operator subgraphs
will need to be recognized as "able to be compiled with Gandiva". For
example

* log(table.a + table.b) + 1 can be compiled with gandiva
* udf(table.a), where "udf" is a user-defined function written in
Python, say, cannot

> To do this, I think we would need to be able build a pipeline with high level 
> operators that move along data one record batch at a time :
> - source operator which only produces record-batches (maybe, csv reader)
> - intermediate operators that can produce/consume record-batches (maybe, 
> gandiva project operator)
> - terminal operators that emit the final output (from the end of the 
> pipeline) when there is nothing left to consume (maybe, SumKernel)
>
> Are we thinking along these lines ?

Yes, that sounds right.

We don't have a "dataset" framework yet for source operators (CSV,
JSON, and Parquet will be the first ones). I have planned to write a
requirements document about this project as we have the machinery in
place to e.g. read Parquet and CSV files but without a unified
abstract dataset API

- Wes

>
> Thanks & regards,
> Ravindra.

Reply via email to