Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Antoine Pitrou Thu, 09 Apr 2020 03:25:22 -0700


It seems there are two different concerns here:
- the kernels' public API
- the ease with which kernels can be implemented.


If we need a different public API, then IMO we should change it sooner
rather than later.

As for the implementation, perhaps we should start by drawing
the main categories of kernels.  For example:
- item-wise kernels (and/or vector-wise)
- aggregate kernels

Some kernels will not fall in these categories (e.g. filter, join...),
but many of them do.

Regards

Antoine.


On Wed, 8 Apr 2020 16:04:30 -0500
Wes McKinney <wesmck...@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > Another idea would be to have a variant with const-references instead
> > of shared_ptr. One potential issue with our Datum is that it plays the
> > dual role of transporting both input and output arguments. With
> > outputs it's necessary to be able to convey ownership while with
> > inputs this is less important.
> >
> > In general, I would say that some additional thought is needed about
> > our kernel APIs. Some scattered thoughts about this:
> >
> > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > hoc. Many kernels do not have the ability to tell you up front what
> > type of data will be returned (this is very important in the context
> > of a query engine runtime)  
> 
> Another observation is that many of our kernels are unable to write
> into preallocated memory. This is also a design issue that must be
> addressed (kernels having the same output type invoked in succession
> should be able to elide temporary allocations by writing into the same
> output memory)
> 
> > * It's unclear that having a large collection of
> > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> >
> > Rather, it might be better to have something like:
> >
> > // We know the "schema" of each argument but don't have any data yet
> > std::string kernel_name = "$name";
> > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, 
> > options);
> > auto out = bound_kernel->Call({arg0, arg1});
> >
> > So here it would be
> >
> > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> >                       GetKernel("take", {shapes::chunked_array(int8()),
> >                                          shapes::chunked_array(int8())}));
> >
> > So now take_kernel->shape() should tell you that the expected output
> > is shapes::chunked_array(int8())
> >
> > And Call should preferably accept reference arguments for its input 
> > parameters.
> >
> > As far as kernel-specific options, we could create a variant that
> > includes the different kernel-specific options structures, so that
> > it's not necessary to have different dispatch functions for different
> > kernels.
> >
> > Anyway, just some out loud thoughts, but while we are in this
> > intermediate experimental state I'm supportive of you making the
> > decision that is most convenient for the immediate problem you want to
> > solve with the caveat that things may be refactored significantly in
> > the proximate future.
> >
> > - Wes
> >
> > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> wrote:  
> > >
> > > I did a bit more research on JIRA and we seem to have this open topic 
> > > there also in https://issues.apache.org/jira/browse/ARROW-6959 which is 
> > > the similar topic as my mail is about and in 
> > > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some 
> > > of the interfaces with reference-types.
> > >
> > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:  
> > > > Hello all,
> > > >
> > > > I'm in the progress of changing the implementation of the Take kernel
> > > > to work on ChunkedArrays without concatenating them into a single Array
> > > > first. While working on the implementation, I realised that we switch
> > > > often between Datum and the specific-typed parameters. This works quite
> > > > well for the combination of Array& and Datum(shared_ptr<ArrayData>) as
> > > > here the reference object with type Array& always carries a shared
> > > > reference with it, so switching between Array& and its Datum is quite
> > > > easy.
> > > >
> > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > requires a shared_ptr<ChunkedArray> which cannot be constructed from
> > > > the reference type. Thus to allow interfaces like `Status
> > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > > indices,` to pass successfully their arguments to the Kernel
> > > > implementation, we have to do:
> > > >
> > > > a) Remove the references from the interface of the Take() function and
> > > > use `shared_ptr` instances everywhere.
> > > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > > with specific references instead of Datum instances
> > > >
> > > > Personally I would prefer b) as this allow us to make more use of the
> > > > C++ type system and would also avoid the shared_ptr overhead where not
> > > > necessary.
> > > >
> > > > Cheers,
> > > > Uwe
> > > >  
>

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Reply via email to