hi Ravindra,
On Wed, Feb 13, 2019 at 1:34 AM Ravindra Pindikura <ravin...@dremio.com> wrote: > > Hi, > > I was looking at the recent checkin for arrow kernels, and started to think > of how they would work alongside Gandiva. > > Here are my thoughts : > > 1. Gandiva already has two high-level operators namely project and filter, > with runtime code generation > > * It already supports 100s of functions (eg. a+b, a > b), which can be > combined into expressions (eg. a+b > c && a +b < d) for each of the operators > and we’ll likely continue to add more of them. > * it works on one record batch at a time - consumes a record batch, and > produces a record batch. > * The operators can be inter-linked (eg. Project -> filter -> project) to > build a pipeline. > * we may build additional operators in the future which could benefit from > code generation (eg. Impala uses code generation when parsing Avro files). > > 2. Arrow Kernels > > a. support project/filter operators > > Useful for functions where there is no benefit with code generation, or where > code generation can be skipped over (eager evaluation). > > b. Support for additional operators like aggregates > > > How do we combine and link the gandiva operators and the kernels ? For eg. It > would be nice to have a pipeline with scan (read from source), project > (expression on column), filter (extract rows), and aggregate (sum on the > extracted column). > I'm just starting on a project now to enable logical expressions to be built with a C++ API that are formed from a superset of relationship algebra. In the past, I have already built a fairly deep system for this (https://github.com/ibis-project/ibis) that captures SQL semantics while also permitting other kinds of analytical functionality that spills outside of SQL. As soon as I cobble together a prototype I'll be interested in lots of feedback since this is a system that we'll use for many years My idea of the interplay with Gandiva is that expressions will be "lowered" to physical execution operators such as: * Projection (with/without filters) * Join * Hash aggregate (with/without filters) * Sort * Time series operations (resampling / binning, etc.) Gandiva therefore serves as a subgraph compiler for constituent parts of these operations . During the lowering process, operator subgraphs will need to be recognized as "able to be compiled with Gandiva". For example * log(table.a + table.b) + 1 can be compiled with gandiva * udf(table.a), where "udf" is a user-defined function written in Python, say, cannot > To do this, I think we would need to be able build a pipeline with high level > operators that move along data one record batch at a time : > - source operator which only produces record-batches (maybe, csv reader) > - intermediate operators that can produce/consume record-batches (maybe, > gandiva project operator) > - terminal operators that emit the final output (from the end of the > pipeline) when there is nothing left to consume (maybe, SumKernel) > > Are we thinking along these lines ? Yes, that sounds right. We don't have a "dataset" framework yet for source operators (CSV, JSON, and Parquet will be the first ones). I have planned to write a requirements document about this project as we have the machinery in place to e.g. read Parquet and CSV files but without a unified abstract dataset API - Wes > > Thanks & regards, > Ravindra.