Re: Gandiva Initiative

Julian Hyde Fri, 22 Jun 2018 11:10:38 -0700

This is exciting. We have wanted to build an Arrow adapter in Calcite for some 
time and have a prototype (see 
https://issues.apache.org/jira/browse/CALCITE-2173 
<https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we can 
use Gandiva. I know that Gandiva has Java bindings, but will these allow 
queries to be compiled and executed from a pure Java process?

Can you describe Gandiva’s governance model? Without an open governance model, 
companies that compete with Dremio may be wary about contributing.

Can you compare and contrast your approach to Hyper[1]? Hyper is also concerned 
with efficient use to the bus, and also uses LLVM, but it has a different 
memory format and places much emphasis on lock-free data structures.

I just attended SIGMOD and there were interesting industry papers from 
MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks 
MemSQL uses to achieve SIMD parallelism on queries such as “select k4, sum(x) 
from t group by k4” (where k4 has 4 values).

I missed part of the RAPID talk, but I got the impression that they are using 
disk-based algorithms (e.g. hybrid hash join) to handle data spread between 
fast and slow memory.

MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would be 
good target for Gandiva also. It is a table scan with a range filter (returning 
98% of rows), a low-cardinality aggregate (grouping by two fields with 3 values 
each), and several aggregate functions, the arguments of which contain common 
sub-expressions.

  SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity),
    sum(l_extendedprice),
    sum(l_extendedprice * (1 - l_discount)),
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
    avg(l_quantity),
    avg(l_extendedprice),
    avg(l_discount),
    count(*)
  FROM lineitem
  WHERE l_shipdate <= date '1998-12-01' - interval '90’ day
  GROUP BY
    l_returnflag,
    l_linestatus
  ORDER BY
    l_returnflag,
    l_linestatus;

Julian

[1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf 
<http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>

[2] 
http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/

<http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>

[3] https://dl.acm.org/citation.cfm?id=3183713.3190658 
<https://dl.acm.org/citation.cfm?id=3183713.3190658>

[4] https://dl.acm.org/citation.cfm?id=3183713.3190655 
<https://dl.acm.org/citation.cfm?id=3183713.3190655>

> On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote:
> 
> Hi everyone,
> 
> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that 
> the combination of arrow and llvm for efficient expression evaluation is 
> powerful, and has a broad range of use-cases. We've just started and hope to 
> finesse and add a lot of functionality over the next few months.
> 
> Welcome your feedback and participation in gandiva !!
> 
> thanks & regards,
> ravindra.
> 
> On 2018/06/21 19:15:20, Jacques Nadeau <jacq...@apache.org> wrote: 
>> Hey Guys,
>> 
>> Dremio just open sourced a new framework for processing data in Arrow data
>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
>> Arrow Java libraries. I expect the developers who have been working on this
>> will introduce themselves soon. To read more about it, take a look at our
>> Ravindra's blog post (he's the lead developer driving this work): [2].
>> Hopefully people will find this interesting/useful.
>> 
>> Let us know what you all think!
>> 
>> thanks,
>> Jacques
>> 
>> 
>> [1] https://github.com/dremio/gandiva
>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>>

Re: Gandiva Initiative

Reply via email to