Hi, I was looking at the recent checkin for arrow kernels, and started to think of how they would work alongside Gandiva.
Here are my thoughts : 1. Gandiva already has two high-level operators namely project and filter, with runtime code generation * It already supports 100s of functions (eg. a+b, a > b), which can be combined into expressions (eg. a+b > c && a +b < d) for each of the operators and we’ll likely continue to add more of them. * it works on one record batch at a time - consumes a record batch, and produces a record batch. * The operators can be inter-linked (eg. Project -> filter -> project) to build a pipeline. * we may build additional operators in the future which could benefit from code generation (eg. Impala uses code generation when parsing Avro files). 2. Arrow Kernels a. support project/filter operators Useful for functions where there is no benefit with code generation, or where code generation can be skipped over (eager evaluation). b. Support for additional operators like aggregates How do we combine and link the gandiva operators and the kernels ? For eg. It would be nice to have a pipeline with scan (read from source), project (expression on column), filter (extract rows), and aggregate (sum on the extracted column). To do this, I think we would need to be able build a pipeline with high level operators that move along data one record batch at a time : - source operator which only produces record-batches (maybe, csv reader) - intermediate operators that can produce/consume record-batches (maybe, gandiva project operator) - terminal operators that emit the final output (from the end of the pipeline) when there is nothing left to consume (maybe, SumKernel) Are we thinking along these lines ? Thanks & regards, Ravindra.