Hi Wes,
Will we still be able to comment on the PR once it is closed?

If we want to be inclusive on feedback it might pay to wait until Tuesday
evening US time to merge since it is a long weekend here.

Thanks,
Micah

On Saturday, May 23, 2020, Wes McKinney <wesmck...@gmail.com> wrote:

> Hi folks -- I've addressed a good deal of feedback and added a lot of
> comments and with Kou's help have got the build passing, It would be
> great if this could be merged soon to unblock follow up PRs
>
> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > I just opened the PR https://github.com/apache/arrow/pull/7240
> >
> > I'm sorry it's so big. I really think this is the best way. The only
> > further work I plan to do on it is to get the CI passing.
> >
> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
> > > work. Since the work is large and (for all practical purposes) nearly
> > > impossible to separate into separately merge-ready PRs, I'll start a
> > > new e-mail thread describing what I've done in more detail and
> > > proposing a path for merging the PR (so as to unblock PRs into
> > > arrow/compute and avoid rebasing headaches) and addressing review
> > > feedback that will likely take several weeks to fully accumulate.
> > >
> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > >
> > > > I'm working actively on this but perhaps as expected it has ballooned
> > > > into a very large project -- it's unclear at the moment whether I'll
> > > > be able to break the work into smaller patches that are easier to
> > > > digest. I'm working as fast as I can to have an initial
> > > > feature-preserving PR up, but the changes to the src/arrow/compute
> > > > directory are extensive, so I would please ask that we do not merge
> > > > non-essential patches into cpp/src/arrow/compute until I get the PR
> up
> > > > (estimated later this week at present rate) so we can see where
> things
> > > > stand.
> > > >
> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > > >
> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Wes,
> > > > > > I haven't had time to read the doc, but wanted to ask some
> questions on
> > > > > > points raised on the thread.
> > > > > >
> > > > > > * For efficiency, kernels used for array-expr evaluation should
> write
> > > > > > > into preallocated memory as their default mode. This enables
> the
> > > > > > > interpreter to avoid temporary memory allocations and improve
> CPU
> > > > > > > cache utilization. Almost none of our kernels are implemented
> this way
> > > > > > > currently.
> > > > > >
> > > > > > Did something change, I was pretty sure I submitted a patch a
> while ago for
> > > > > > boolean kernels, that separated out memory allocation from
> computation.
> > > > > > Which should allow for writing to the same memory.  Is this a
> concern with
> > > > > > the public Function APIs for the Kernel APIs themselves, or a
> lower level
> > > > > > implementation concern?
> > > > >
> > > > > Yes, you did in the internal implementation [1]. The concern is the
> > > > > public API and the general approach to implementing new kernels.
> > > > >
> > > > > I'm working on this right now (it's a large project so it will
> take me
> > > > > a little while to produce something to be reviewed) so bear with
> me =)
> > > > >
> > > > > [1]: https://github.com/apache/arrow/commit/
> 4910fbf4fda05b864daaba820db08291e4afdcb6#diff-
> 561ea05d36150eb15842f452e3f07c76
> > > > >
> > > > > > * Sorting is generally handled by different data processing
> nodes from
> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
> Joins.
> > > > > > > Projections and Filters use expressions, they do not sort.
> > > > > >
> > > > > > Would sorting the list-column elements per row be an array-expr?
> > > > >
> > > > > Yes, as that's an element-wise function. When I said sorting I was
> > > > > referring to ORDER BY. The functions we have that do sorting do so
> in
> > > > > the context of a single array [2].
> > > > >
> > > > > A query engine must be able to sort a (potentially very large)
> stream
> > > > > of record batches. One approach is for the Sort operator to exhaust
> > > > > its child input, accumulating all of the record batches in memory
> > > > > (spilling to disk as needed) and then sorting and emitting record
> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting
> code
> > > > > [3] [4]
> > > > >
> > > > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/compute/kernels/sort_to_indices.h#L34
> > > > > [3]: https://github.com/apache/impala/blob/master/be/src/
> runtime/sorter.h
> > > > > [4]: https://github.com/apache/impala/blob/master/be/src/
> exec/sort-node.h
> > > > >
> > > > > >
> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <
> wesmck...@gmail.com> wrote:
> > > > > >
> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <
> anto...@python.org> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > > > > > >>
> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a
> problem,
> > > > > > > since
> > > > > > > > >> only the second pass writes to the output.
> > > > > > > > >
> > > > > > > > > This kernel is not valid for normal array-exprs (see the
> spreadsheet I
> > > > > > > > > linked), such as what you can write in SQL
> > > > > > > > >
> > > > > > > > > Kernels like SortToIndices are a different type of
> function (in other
> > > > > > > > > words, "not a SQL function") and so if we choose to allow
> such a
> > > > > > > > > "non-SQL-like" functions in the expression evaluator then
> different
> > > > > > > > > logic must be used.
> > > > > > > >
> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level
> we're
> > > > > > > > talking here.  SortToIndices() may not be a "SQL function",
> but it looks
> > > > > > > > like an important basic block for a query engine (since,
> after all,
> > > > > > > > sorting results is an often used feature in SQL and other
> languages).
> > > > > > > > So it should be usable *inside* the expression engine, even
> though it's
> > > > > > > > not part of the exposed vocabulary, no?
> > > > > > >
> > > > > > > No, not as part of "expressions" as they are defined in the
> context of
> > > > > > > SQL engines.
> > > > > > >
> > > > > > > Sorting is generally handled by different data processing
> nodes from
> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
> Joins.
> > > > > > > Projections and Filters use expressions, they do not sort.
> > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > >
>

Reply via email to