Hi,

I was looking at the recent checkin for arrow kernels, and started to think of 
how they would work alongside Gandiva.

Here are my thoughts :

1. Gandiva already has two high-level operators namely project and filter, with 
runtime code generation

* It already supports 100s of functions (eg. a+b, a > b), which can be combined 
into expressions (eg. a+b > c && a +b < d) for each of the operators and we’ll 
likely continue to add more of them.
* it works on one record batch at a time - consumes a record batch, and 
produces a record batch.
* The operators can be inter-linked (eg. Project -> filter -> project) to build 
a pipeline.
* we may build additional operators in the future which could benefit from code 
generation (eg. Impala uses code generation when parsing Avro files).

2. Arrow Kernels 

a. support project/filter operators

Useful for functions where there is no benefit with code generation, or where 
code generation can be skipped over (eager evaluation).

b. Support for additional operators like aggregates


How do we combine and link the gandiva operators and the kernels ? For eg. It 
would be nice to have a pipeline with scan (read from source),  project 
(expression on column), filter (extract rows), and aggregate (sum on the 
extracted column).

To do this, I think we would need to be able build a pipeline with high level 
operators that move along data one record batch at a time :
- source operator which only produces record-batches (maybe, csv reader)
- intermediate operators that can produce/consume record-batches (maybe, 
gandiva project operator)
- terminal operators that emit the final output (from the end of the pipeline) 
when there is nothing left to consume (maybe, SumKernel)

Are we thinking along these lines ?

Thanks & regards,
Ravindra.

Reply via email to