Thanks, Wes and Francois.
> On Feb 13, 2019, at 10:16 PM, Francois Saint-Jacques
> <fsaintjacq...@gmail.com> wrote:
>
> Hi,
>
> I also agree that we should follow a model similar to what you propose. I
> think the plan is, correct me if I'm wrong Wes, to write the logical plan
> operators, then write a small execution engine prototype and produce a
> proper design document out of this experiment. There's also a placeholder
> ticket: https://issues.apache.org/jira/browse/ARROW-4333
>
> François
>
> On Wed, Feb 13, 2019 at 9:54 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Ravindra,
>>
>>
>> On Wed, Feb 13, 2019 at 1:34 AM Ravindra Pindikura <ravin...@dremio.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I was looking at the recent checkin for arrow kernels, and started to
>> think of how they would work alongside Gandiva.
>>>
>>> Here are my thoughts :
>>>
>>> 1. Gandiva already has two high-level operators namely project and
>> filter, with runtime code generation
>>>
>>> * It already supports 100s of functions (eg. a+b, a > b), which can be
>> combined into expressions (eg. a+b > c && a +b < d) for each of the
>> operators and we’ll likely continue to add more of them.
>>> * it works on one record batch at a time - consumes a record batch, and
>> produces a record batch.
>>> * The operators can be inter-linked (eg. Project -> filter -> project)
>> to build a pipeline.
>>> * we may build additional operators in the future which could benefit
>> from code generation (eg. Impala uses code generation when parsing Avro
>> files).
>>>
>>> 2. Arrow Kernels
>>>
>>> a. support project/filter operators
>>>
>>> Useful for functions where there is no benefit with code generation, or
>> where code generation can be skipped over (eager evaluation).
>>>
>>> b. Support for additional operators like aggregates
>>>
>>>
>>> How do we combine and link the gandiva operators and the kernels ? For
>> eg. It would be nice to have a pipeline with scan (read from source),
>> project (expression on column), filter (extract rows), and aggregate (sum
>> on the extracted column).
>>>
>>
>> I'm just starting on a project now to enable logical expressions to be
>> built with a C++ API that are formed from a superset of relationship
>> algebra. In the past, I have already built a fairly deep system for
>> this (https://github.com/ibis-project/ibis) that captures SQL
>> semantics while also permitting other kinds of analytical
>> functionality that spills outside of SQL. As soon as I cobble together
>> a prototype I'll be interested in lots of feedback since this is a
>> system that we'll use for many years
>>
>> My idea of the interplay with Gandiva is that expressions will be
>> "lowered" to physical execution operators such as:
>>
>> * Projection (with/without filters)
>> * Join
>> * Hash aggregate (with/without filters)
>> * Sort
>> * Time series operations (resampling / binning, etc.)
>>
>> Gandiva therefore serves as a subgraph compiler for constituent parts
>> of these operations . During the lowering process, operator subgraphs
>> will need to be recognized as "able to be compiled with Gandiva". For
>> example
>>
>> * log(table.a + table.b) + 1 can be compiled with gandiva
>> * udf(table.a), where "udf" is a user-defined function written in
>> Python, say, cannot
>>
>>> To do this, I think we would need to be able build a pipeline with high
>> level operators that move along data one record batch at a time :
>>> - source operator which only produces record-batches (maybe, csv reader)
>>> - intermediate operators that can produce/consume record-batches (maybe,
>> gandiva project operator)
>>> - terminal operators that emit the final output (from the end of the
>> pipeline) when there is nothing left to consume (maybe, SumKernel)
>>>
>>> Are we thinking along these lines ?
>>
>> Yes, that sounds right.
>>
>> We don't have a "dataset" framework yet for source operators (CSV,
>> JSON, and Parquet will be the first ones). I have planned to write a
>> requirements document about this project as we have the machinery in
>> place to e.g. read Parquet and CSV files but without a unified
>> abstract dataset API
>>
>> - Wes
>>
>>>
>>> Thanks & regards,
>>> Ravindra.
>>