Hi Wes, I think reducing temporary memory allocation is a great effort and will show great benefit in compute intensive scenarios. As we are mainly working with the Rust and Datafusion part of the Arrow project I was wondering how we could best align the concepts and implementations on that level. Is the approach that the compute kernels have to be implemented for every language separately as well similar to the format (which is my current understanding) or do you see options for re-use and a common/shared implementation? Happy to get your thoughts on that.
Thanks, Sven On Sun, Apr 19, 2020 at 2:14 AM Wes McKinney <wesmck...@gmail.com> wrote: > I started a brain dump of some issues that come to mind around kernel > implementation and array expression evaluation. I'll try to fill this > out, and it would be helpful to add supporting citations to other > projects about what kinds of issues come up and what implementation > strategies may be helpful. If anyone would like edit access please let > me know, otherwise feel free to add comments > > > https://docs.google.com/document/d/1LFk3WRfWGQbJ9uitWwucjiJsZMqLh8lC1vAUOscLtj8/edit?usp=sharing > > On Sat, Apr 18, 2020 at 4:41 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > hi folks, > > > > This e-mail comes in the context of two C++ data processing > > subprojects we have discussed in the past > > > > * Data Frame API > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit > > * In-memory Query Engine > > > https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit > > > > One important area for both of these projects is the evaluation of > > array expressions that are found inside projections (i.e. "SELECT" -- > > without any aggregate functions), filters (i.e. "WHERE"), and join > > clauses. In summary, these are mostly elementwise functions that do > > not alter the size of their input arrays. For conciseness I'll call > > these kinds of expressions "array-exprs" > > > > Gandiva does LLVM code generation for evaluating array-exprs, but I > > think it makes sense to have both codegen'd expressions as well as > > X100-style [1] interpreted array-expr evaluation. These modes of query > > evaluation ideally should share as much common implementation code as > > possible (taking into account Gandiva's need to compile to LLVM IR) > > > > I have reviewed the contents of our arrow/compute/kernels directory > > and made the following spreadsheet > > > > > https://docs.google.com/spreadsheets/d/1oXVy29GT1apM4e3isNpzk497Re3OqM8APvlBZCDc8B8/edit?usp=sharing > > > > There are some problems with our current collection of kernels in the > > context of array-expr evaluation in query processing: > > > > * For efficiency, kernels used for array-expr evaluation should write > > into preallocated memory as their default mode. This enables the > > interpreter to avoid temporary memory allocations and improve CPU > > cache utilization. Almost none of our kernels are implemented this way > > currently. > > * The current approach for expr-type kernels of having a top-level > > memory-allocating function is not scalable for binding developers. I > > believe instead that kernels should be selected and invoked > > generically by using the string name of the kernel > > > > On this last point, what I am suggesting is that we do something more > like > > > > ASSIGN_OR_RAISE(auto kernel, compute::GetKernel("greater", {type0, > type1})); > > ArrayData* out = ... ; > > RETURN_NOT_OK(kernel->Call({arg0, arg1}, &out)); > > > > In particular, when we reason that successive kernel invocations can > > reuse memory, we can have code that is doing in essence > > > > k1->Call({arg0, arg1}, &out) > > k2->Call({out}, &out)) > > k3->Call({arg2, out}, &out) > > > > Of course, if you want the kernel to allocate memory then you can do > > something like > > > > ASSIGN_OR_RAISE(out = kernel->Call({arg0, arg1})); > > > > And if you want to avoid the "GetKernel" business you should be able to > do > > > > ASSIGN_OR_RAISE(auto result, ExecKernel(name, {arg0, arg1})); > > > > I think this "ExecKernel" function should likely replace the public > > APIs for running specific kernels that we currently have. > > > > One detail that isn't addressed above is what to do with > > kernel-specific configuration options. One way to address that is to > > have a common base type for all options so that we can do > > > > struct MyKernelOptions : public KernelOptions { > > ... > > }; > > > > and then > > > > MyKernelOptions options = ...; > > out = ExecKernel(name, args, options); > > > > Maybe there are some other ideas. > > > > At some point we will need to have an implementation blitz where we go > > from our current 20 or so non-codegen'd kernels for array-exprs to > > several hundred, so I wanted to discuss these issues so we can all get > > on the same page. I'd like to take a crack at an initial iteration of > > the above API proposal with a centralized kernel registry (that will > > also need to support having UDFs) so we can begin porting the existing > > array-expr kernels to use the new API and then have a clear path > > forward for expanding our set of supported functions (which will also > > ideally involve sharing implementation code with Gandiva, particularly > > its "precompiled" directory) > > > > Thanks, > > Wes > > > > [1]: http://cidrdb.org/cidr2005/papers/P19.pdf > -- *Sven Wagner-Boysen* | Lead Software Engineer T: +49 160 930 70 6 70 | www.signavio.com Kurfürstenstraße 111, 10787 Berlin, Germany [image: https://www.signavio.com/] <https://www.signavio.com/> <https://twitter.com/signavio/> <https://www.facebook.com/signavio/> <https://www.linkedin.com/company/signavio/> <https://www.xing.com/company/signavio> <https://www.youtube.com/user/signavio> <https://www.signavio.com/live/?utm_campaign=Signavio%20Live%202020&utm_source=email&utm_medium=signature> Check my LinkedIn Profile <https://www.linkedin.com/in/sven-wagner-boysen-84231046/> *Review us:* Gartner Peer Insights <https://www.gartner.com/reviews/market/enterprise-business-process-analysis/vendor/Signavio/product/signavio-process-manager> HRB 121584 B Charlottenburg District Court, VAT ID: DE265675123 Managing Directors: Dr. Gero Decker, Daniel Rosenthal