hi folks,

This e-mail comes in the context of two C++ data processing
subprojects we have discussed in the past

* Data Frame API
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit
* In-memory Query Engine
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit

One important area for both of these projects is the evaluation of
array expressions that are found inside projections (i.e. "SELECT" --
without any aggregate functions), filters (i.e. "WHERE"), and join
clauses. In summary, these are mostly elementwise functions that do
not alter the size of their input arrays. For conciseness I'll call
these kinds of expressions "array-exprs"

Gandiva does LLVM code generation for evaluating array-exprs, but I
think it makes sense to have both codegen'd expressions as well as
X100-style [1] interpreted array-expr evaluation. These modes of query
evaluation ideally should share as much common implementation code as
possible (taking into account Gandiva's need to compile to LLVM IR)

I have reviewed the contents of our arrow/compute/kernels directory
and made the following spreadsheet

https://docs.google.com/spreadsheets/d/1oXVy29GT1apM4e3isNpzk497Re3OqM8APvlBZCDc8B8/edit?usp=sharing

There are some problems with our current collection of kernels in the
context of array-expr evaluation in query processing:

* For efficiency, kernels used for array-expr evaluation should write
into preallocated memory as their default mode. This enables the
interpreter to avoid temporary memory allocations and improve CPU
cache utilization. Almost none of our kernels are implemented this way
currently.
* The current approach for expr-type kernels of having a top-level
memory-allocating function is not scalable for binding developers. I
believe instead that kernels should be selected and invoked
generically by using the string name of the kernel

On this last point, what I am suggesting is that we do something more like

ASSIGN_OR_RAISE(auto kernel, compute::GetKernel("greater", {type0, type1}));
ArrayData* out = ... ;
RETURN_NOT_OK(kernel->Call({arg0, arg1}, &out));

In particular, when we reason that successive kernel invocations can
reuse memory, we can have code that is doing in essence

k1->Call({arg0, arg1}, &out)
k2->Call({out}, &out))
k3->Call({arg2, out}, &out)

Of course, if you want the kernel to allocate memory then you can do
something like

ASSIGN_OR_RAISE(out = kernel->Call({arg0, arg1}));

And if you want to avoid the "GetKernel" business you should be able to do

ASSIGN_OR_RAISE(auto result, ExecKernel(name, {arg0, arg1}));

I think this "ExecKernel" function should likely replace the public
APIs for running specific kernels that we currently have.

One detail that isn't addressed above is what to do with
kernel-specific configuration options. One way to address that is to
have a common base type for all options so that we can do

struct MyKernelOptions : public KernelOptions {
  ...
};

and then

MyKernelOptions options = ...;
out = ExecKernel(name, args, options);

Maybe there are some other ideas.

At some point we will need to have an implementation blitz where we go
from our current 20 or so non-codegen'd kernels for array-exprs to
several hundred, so I wanted to discuss these issues so we can all get
on the same page. I'd like to take a crack at an initial iteration of
the above API proposal with a centralized kernel registry (that will
also need to support having UDFs) so we can begin porting the existing
array-expr kernels to use the new API and then have a clear path
forward for expanding our set of supported functions (which will also
ideally involve sharing implementation code with Gandiva, particularly
its "precompiled" directory)

Thanks,
Wes

[1]: http://cidrdb.org/cidr2005/papers/P19.pdf

Reply via email to