On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes,
> I haven't had time to read the doc, but wanted to ask some questions on
> points raised on the thread.
>
> * For efficiency, kernels used for array-expr evaluation should write
> > into preallocated memory as their default mode. This enables the
> > interpreter to avoid temporary memory allocations and improve CPU
> > cache utilization. Almost none of our kernels are implemented this way
> > currently.
>
> Did something change, I was pretty sure I submitted a patch a while ago for
> boolean kernels, that separated out memory allocation from computation.
> Which should allow for writing to the same memory.  Is this a concern with
> the public Function APIs for the Kernel APIs themselves, or a lower level
> implementation concern?

Yes, you did in the internal implementation [1]. The concern is the
public API and the general approach to implementing new kernels.

I'm working on this right now (it's a large project so it will take me
a little while to produce something to be reviewed) so bear with me =)

[1]: 
https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76

> * Sorting is generally handled by different data processing nodes from
> > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > Projections and Filters use expressions, they do not sort.
>
> Would sorting the list-column elements per row be an array-expr?

Yes, as that's an element-wise function. When I said sorting I was
referring to ORDER BY. The functions we have that do sorting do so in
the context of a single array [2].

A query engine must be able to sort a (potentially very large) stream
of record batches. One approach is for the Sort operator to exhaust
its child input, accumulating all of the record batches in memory
(spilling to disk as needed) and then sorting and emitting record
batches from the sorted records/tuples. See e.g. Impala's sorting code
[3] [4]

[2]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
[3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
[4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h

>
> On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <anto...@python.org> wrote:
> > >
> > >
> > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > >>
> > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > since
> > > >> only the second pass writes to the output.
> > > >
> > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > linked), such as what you can write in SQL
> > > >
> > > > Kernels like SortToIndices are a different type of function (in other
> > > > words, "not a SQL function") and so if we choose to allow such a
> > > > "non-SQL-like" functions in the expression evaluator then different
> > > > logic must be used.
> > >
> > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > like an important basic block for a query engine (since, after all,
> > > sorting results is an often used feature in SQL and other languages).
> > > So it should be usable *inside* the expression engine, even though it's
> > > not part of the exposed vocabulary, no?
> >
> > No, not as part of "expressions" as they are defined in the context of
> > SQL engines.
> >
> > Sorting is generally handled by different data processing nodes from
> > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > Projections and Filters use expressions, they do not sort.
> >
> > > Regards
> > >
> > > Antoine.
> >

Reply via email to