It seems there are two different concerns here: - the kernels' public API - the ease with which kernels can be implemented.
If we need a different public API, then IMO we should change it sooner rather than later. As for the implementation, perhaps we should start by drawing the main categories of kernels. For example: - item-wise kernels (and/or vector-wise) - aggregate kernels Some kernels will not fall in these categories (e.g. filter, join...), but many of them do. Regards Antoine. On Wed, 8 Apr 2020 16:04:30 -0500 Wes McKinney <wesmck...@gmail.com> wrote: > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > Another idea would be to have a variant with const-references instead > > of shared_ptr. One potential issue with our Datum is that it plays the > > dual role of transporting both input and output arguments. With > > outputs it's necessary to be able to convey ownership while with > > inputs this is less important. > > > > In general, I would say that some additional thought is needed about > > our kernel APIs. Some scattered thoughts about this: > > > > * Kernel "binding" (selecting a kernel given input types) is a bit ad > > hoc. Many kernels do not have the ability to tell you up front what > > type of data will be returned (this is very important in the context > > of a query engine runtime) > > Another observation is that many of our kernels are unable to write > into preallocated memory. This is also a design issue that must be > addressed (kernels having the same output type invoked in succession > should be able to elide temporary allocations by writing into the same > output memory) > > > * It's unclear that having a large collection of > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable. > > > > Rather, it might be better to have something like: > > > > // We know the "schema" of each argument but don't have any data yet > > std::string kernel_name = "$name"; > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, > > options); > > auto out = bound_kernel->Call({arg0, arg1}); > > > > So here it would be > > > > ARROW_ASSIGN_OR_RAISE(auto take_kernel, > > GetKernel("take", {shapes::chunked_array(int8()), > > shapes::chunked_array(int8())})); > > > > So now take_kernel->shape() should tell you that the expected output > > is shapes::chunked_array(int8()) > > > > And Call should preferably accept reference arguments for its input > > parameters. > > > > As far as kernel-specific options, we could create a variant that > > includes the different kernel-specific options structures, so that > > it's not necessary to have different dispatch functions for different > > kernels. > > > > Anyway, just some out loud thoughts, but while we are in this > > intermediate experimental state I'm supportive of you making the > > decision that is most convenient for the immediate problem you want to > > solve with the caveat that things may be refactored significantly in > > the proximate future. > > > > - Wes > > > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> wrote: > > > > > > I did a bit more research on JIRA and we seem to have this open topic > > > there also in https://issues.apache.org/jira/browse/ARROW-6959 which is > > > the similar topic as my mail is about and in > > > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some > > > of the interfaces with reference-types. > > > > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote: > > > > Hello all, > > > > > > > > I'm in the progress of changing the implementation of the Take kernel > > > > to work on ChunkedArrays without concatenating them into a single Array > > > > first. While working on the implementation, I realised that we switch > > > > often between Datum and the specific-typed parameters. This works quite > > > > well for the combination of Array& and Datum(shared_ptr<ArrayData>) as > > > > here the reference object with type Array& always carries a shared > > > > reference with it, so switching between Array& and its Datum is quite > > > > easy. > > > > > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum > > > > requires a shared_ptr<ChunkedArray> which cannot be constructed from > > > > the reference type. Thus to allow interfaces like `Status > > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array& > > > > indices,` to pass successfully their arguments to the Kernel > > > > implementation, we have to do: > > > > > > > > a) Remove the references from the interface of the Take() function and > > > > use `shared_ptr` instances everywhere. > > > > b) Add interfaces to kernels like the TakeKernel that allow calling > > > > with specific references instead of Datum instances > > > > > > > > Personally I would prefer b) as this allow us to make more use of the > > > > C++ type system and would also avoid the shared_ptr overhead where not > > > > necessary. > > > > > > > > Cheers, > > > > Uwe > > > > >