Arrow and oamap [was: Re: Gandiva Initiative]

Wes McKinney Mon, 25 Jun 2018 13:46:21 -0700

hi Martin,

These projects are very different. Many analytic databases feature
code generation (recently a lot of these use LLVM -- see Hyper, Apache
Impala, and others) on the hot paths for function evaluation (e.g. for
evaluating the expressions in the SELECT part or the WHERE part) --
the reason people are excited about Gandiva is that it makes this type
of functionality available as a library running atop an open standard
memory format (Arrow columnar), so can be used in any programming
language assuming suitable bindings can be developed. This is very
much in line with our vision for creating a "deconstructed database"
(see a talk that Julien gave on this topic:
https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database)


I have not looked a great deal at oamap, but it does not use the Arrow
columnar format AFAIK. It is written in Python and presumes some other
technologies in use (like the ROOT format).

So to summarize:

Gandiva
* Compiles analytical expressions to execute against Arrow columnar format
* Is written in C++ and can be embedded in other systems (Dremio is
using it from Java)

oamap
* Does not use the Arrow columnar format
* Presumes other technologies in use (ROOT)
* Is written in Python, and would be challenging to use an embedded
system component

I'm certain these projects can learn from each other -- I have spoken
with Jim (one of the developers of oamap) in the past, so welcome
further discussion here on the mailing list.

Thanks,
Wes

On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
<martin.dur...@utoronto.ca> wrote:
> I am a little surprised by the very positive reception to Gandiva (which 
> doubtless is very useful - I know very little about it) versus when I brought 
> up the prospect of using oamap ( https://github.com/diana-hep/oamap ) on this 
> mailing list.
>
> oamap uses numba to compile *python* functions at run-time and can walk 
> complex nested schema down to leaf nodes in native python syntax (for-loops 
> and attribute/item lookup) but at full machine speed, and without 
> materialising any objects along the way. It was written for the ROOT format, 
> but has implementations for simple types in parquet and arrow, which each do 
> the nested lists and dict things similarly but differently.
>
> Would someone care to explain the silence over oamap?
>
>> On 25 Jun 2018, at 02:06, Praveen Kumar <prav...@dremio.com> wrote:
>>
>> Hi Everyone,
>>
>> I am Praveen, another engineer working on Gandiva. The interest and speed of 
>> engagement around this is great !!Excited to engage with you folks on this.
>>
>> Thx.
>>
>> On 2018/06/22 18:09:42, Julian Hyde <j...@apache.org> wrote:
>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for 
>>> some time and have a prototype (see 
>>> https://issues.apache.org/jira/browse/CALCITE-2173 
>>> <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we 
>>> can use Gandiva. I know that Gandiva has Java bindings, but will these 
>>> allow queries to be compiled and executed from a pure Java process?>
>>>
>>> Can you describe Gandiva’s governance model? Without an open governance 
>>> model, companies that compete with Dremio may be wary about contributing.>
>>>
>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also 
>>> concerned with efficient use to the bus, and also uses LLVM, but it has a 
>>> different memory format and places much emphasis on lock-free data 
>>> structures.>
>>>
>>> I just attended SIGMOD and there were interesting industry papers from 
>>> MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks 
>>> MemSQL uses to achieve SIMD parallelism on queries such as “select k4, 
>>> sum(x) from t group by k4” (where k4 has 4 values).>
>>>
>>> I missed part of the RAPID talk, but I got the impression that they are 
>>> using disk-based algorithms (e.g. hybrid hash join) to handle data spread 
>>> between fast and slow memory.>
>>>
>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would 
>>> be good target for Gandiva also. It is a table scan with a range filter 
>>> (returning 98% of rows), a low-cardinality aggregate (grouping by two 
>>> fields with 3 values each), and several aggregate functions, the arguments 
>>> of which contain common sub-expressions.>
>>>
>>>  SELECT>
>>>    l_returnflag,>
>>>    l_linestatus,>
>>>    sum(l_quantity),>
>>>    sum(l_extendedprice),>
>>>    sum(l_extendedprice * (1 - l_discount)),>
>>>    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>>>    avg(l_quantity),>
>>>    avg(l_extendedprice),>
>>>    avg(l_discount),>
>>>    count(*)>
>>>  FROM lineitem>
>>>  WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>>>  GROUP BY>
>>>    l_returnflag,>
>>>    l_linestatus>
>>>  ORDER BY>
>>>    l_returnflag,>
>>>    l_linestatus;>
>>>
>>> Julian>
>>>
>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf 
>>> <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>>>
>>> [2] 
>>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>>>  
>>> <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>>
>>>
>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 
>>> <https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>>>
>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 
>>> <https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>>>
>>>> On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote:>
>>>>>
>>>> Hi everyone,>
>>>>>
>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that 
>>>> the combination of arrow and llvm for efficient expression evaluation is 
>>>> powerful, and has a broad range of use-cases. We've just started and hope 
>>>> to finesse and add a lot of functionality over the next few months.>
>>>>>
>>>> Welcome your feedback and participation in gandiva !!>
>>>>>
>>>> thanks & regards,>
>>>> ravindra.>
>>>>>
>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>>>>> Hey Guys,>
>>>>>>
>>>>> Dremio just open sourced a new framework for processing data in Arrow 
>>>>> data>
>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging>
>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the 
>>>>> Apache>
>>>>> Arrow Java libraries. I expect the developers who have been working on 
>>>>> this>
>>>>> will introduce themselves soon. To read more about it, take a look at our>
>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].>
>>>>> Hopefully people will find this interesting/useful.>
>>>>>>
>>>>> Let us know what you all think!>
>>>>>>
>>>>> thanks,>
>>>>> Jacques>
>>>>>>
>>>>>>
>>>>> [1] https://github.com/dremio/gandiva>
>>>>> [2] 
>>>>> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>>>>>>
>>>
>
> —
> Martin Durant
> martin.dur...@utoronto.ca
>
>
>

Arrow and oamap [was: Re: Gandiva Initiative]

Reply via email to