Late to the party.

Thanks Weston for sharing the thoughts around Acero. We are actually a
pretty heavy Acero user right now and are trying to take part in Acero
maintenance and development. Internally we are using Acero for a time
series streaming data processing system.

I would +1 on many of Weston's directions here, in particular to make Acero
extensionable / customizable. IMO Acero might not be the fastest "Arrow
SQL/TPC-H" engine, but the ability to customize it for ordered time series
is a huge/kill feature.

In addition to what Weston has already said, my other two cents is that I
think Acero would benefit from a separation from the Arrow core C++
library, similar to how Arrow Flight is. The main reason is that Arrow core
being such a widely used library, it benefits more from being stable and
Acero being a relatively new and standalone component, benefits more from
fast moving / quick experiment. My colleague and I are working on
https://github.com/apache/arrow/issues/15280 to make this happen.





On Fri, Mar 10, 2023 at 5:59 AM Andrew Lamb <al...@influxdata.com> wrote:

> I don't know much about the Acero user base, but gathering some significant
> term users (e.g. Ballista, Urban Logiq, GreptimeDB, InfluxDB IOx, etc) has
> been very helpful for DataFusion. Not only do such users bring some amount
> of maintenance capacity, but perhaps more relevantly to your discussion
> they bring a focus to the project with their usecases.
>
> With so many possible tradeoffs (e.g. streaming vs larger batch execution
> as you mention above) having people to help focus the choice of project I
> think has served DataFusion well.
>
> If Acero has such users (or potential users) perhaps reaching out to them /
> soliciting their ideas of where they want to see the project go would be a
> valuable focusing exercise.
>
> Andrew
>
> On Thu, Mar 9, 2023 at 6:35 PM Aldrin <akmon...@ucsc.edu.invalid> wrote:
>
> > Thanks for sharing! There are a variety of things that I didn't know
> about
> > (such as ExecBatchBuilder) and it's interesting to hear about the
> > performance challenges.
> >
> > How much would future substrait work involve integration with Acero? I'm
> > curious how much more support of substrait is seen as valuable (should be
> > prioritized) or
> > if additional support is going to be "as-needed". Note that I have a
> > minimal understanding of how "large" substrait is and what proportion of
> it
> > is already supported by
> > Acero.
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Thu, Mar 9, 2023 at 12:33 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Just a reminder for those following other implementations of Arrow,
> that
> > > Acero is the compute/execution engine subsystem baked into Arrow C++.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 09/03/2023 à 21:20, Weston Pace a écrit :
> > > > We are getting closer to another release.  I am thinking about what
> to
> > > work
> > > > on in the next release.  I think it is a good time to have a
> discussion
> > > > about Acero in general.  This is possibly also of interest to those
> > > working
> > > > on pyarrow or r-arrow as these libraries rely on Acero for various
> > > > functionality.  Apache projects have no single owner and what follows
> > is
> > > > only my own personal opinion and plans.  Still, I will apologize in
> > > advance
> > > > for any lingering hubris or outrageous declarations of fact :)
> > > >
> > > > First, some background.  Since we started the project the landscape
> has
> > > > changed.  Most importantly, there are now more arrow-native execution
> > > > engines.  For example, datafusion, duckdb, velox, and I'm sure there
> > are
> > > > probably more.  Substrait has also been created, allowing users to
> > > > hopefully switch between different execution engines as different
> needs
> > > > arise.  Some significant contributors to Acero have taken a break or
> > > moved
> > > > onto other projects and new contributors have arrived with new
> > interests
> > > > and goals (For example, an asof join node and more focus on ordered /
> > > > streaming execution).
> > > >
> > > > I do not personally have the resources for bringing Acero's
> performance
> > > to
> > > > match that of some of the other execution engines.  I'm also not
> aware
> > of
> > > > any significant contributors attempting to do so.  I also think that
> > > having
> > > > yet another engine racing to the top of the TPC-H benchmarks is not
> the
> > > > best thing we can be doing for our users.  To be clear, our
> performance
> > > is
> > > > not "bad" but it is not "state of the art".
> > > >
> > > > ## Some significant performance challenges for Acero:
> > > >
> > > >   1. Ideally an execution engine that wants to win TPC-H should
> operate
> > > on
> > > > L2 sized batches.  To risk stating the obvious: that is not very
> large.
> > > > Typically less than 100k rows.  At that size of operation the
> > philosophy
> > > of
> > > > "we are only doing this per-batch so we don't have to be worried
> about
> > > > performance" falls apart.  Significant pieces of Acero are not built
> to
> > > > operate effectively at this small of a batch size.  This is probably
> > most
> > > > evident in our expression evaluation and in queries that have complex
> > > > expressions invoking many functions.
> > > >
> > > >   2. Our expression evaluation is missing a fair number of
> > optimizations.
> > > > The ability to use temporary vectors instead of allocating new
> vectors
> > > > between function calls.  Usage of selection vectors to avoid
> > > materializing
> > > > filter results.  General avoidance of allocation and preference for
> > > thread
> > > > local data.
> > > >
> > > >   3. Writing a library of compute functions that is compact, able to
> > run
> > > in
> > > > any architecture, and able to take full advantage of the underlying
> > > > hardware is an extremely difficult challenge and there are likely
> > things
> > > > that could be improved in our kernel functions.
> > > >
> > > >   4. Acero does no query optimization.  Hopefully Substrait
> optimizers
> > > will
> > > > emerge to fill this gap.  In the meantime, this remains a significant
> > gap
> > > > when comparing Acero to most other execution engines.
> > > >
> > > > I am not (personally) planning on addressing any of the above issues
> > (no
> > > > time and little interest).  Furthermore, other execution engines
> either
> > > > already handle these things or they are investing significant funds
> to
> > > make
> > > > sure they can.  In fact, I would be in favor of explicitly abandoning
> > the
> > > > morsel-batch model and focusing on larger batch sizes in the spirit
> of
> > > > simplicity.
> > > >
> > > > This does not mean that I want to abandon Acero.  Acero is valuable
> > for a
> > > > number of users who don't need that last 20% of performance and would
> > > > rather not introduce a new library.  Acero has been a valuable
> building
> > > > block for those that are exploring unique execution models or whose
> > > > workloads don't cleanly fit into an SQL query.  Acero has been used
> > > > effectively for academic research.  Acero has been valuable for me
> > > > personally as a sort of "reference implementation" for a Substrait
> > > consumer
> > > > as well as being a reference engine for connectivity and
> > decentralization
> > > > in general.
> > > >
> > > > ## My roadmap
> > > >
> > > > Over the next year I plan on transitioning more time into Substrait
> > work.
> > > > But this is less because I am abandoning Acero and more because I
> would
> > > > like to start wrapping Acero up.  In my mind, Acero as an "extensible
> > > > streaming execution engine" is very nearly "complete" (as much as
> > > anything
> > > > is ever complete).
> > > >
> > > > 1. One significant remaining challenge is getting some better tools
> in
> > > > place for reducing runtime memory usage.  This mostly equates to
> being
> > > > smarter about scanning (in particular how we scan large row groups)
> and
> > > > adding support for spilling to pipeline breakers (there is a
> promising
> > PR
> > > > for this that I have not yet been able to get around to).  I would
> like
> > > to
> > > > find time to address these things over the next year.
> > > >
> > > > 2. I would like Acero to be better documented and more extensible.
> It
> > > > should be relatively simple (and hopefully as foolproof as possible)
> > for
> > > > users to create their own extension nodes.  Perhaps we could even
> > support
> > > > python extension nodes.  There has been some promising work around
> > > > Substrait extension nodes which I think could be generalized to allow
> > > > extension node developers to use Substrait without having to create
> > > .proto
> > > > files.
> > > >
> > > > 3. Finally, pyarrow is a toolbox.  I would like to see some of the
> > > internal
> > > > compute utilities exposed as their own tools (and pyarrow bindings
> > > added).
> > > > Significantly (though I don't think I'll get to all of these):
> > > >
> > > >   * The ExecBatchBuilder is a useful accumulation tool.  It could be
> > > used,
> > > > for example, to load datasets into pandas that are almost as big as
> RAM
> > > > (today you typically need at least 2x memory to convert to pandas).
> > > >   * The GroupingSegmenter could be used to support workflows like
> > "Group
> > > by
> > > > X and then give me a pandas dataframe for each group".
> > > >   * Whatever utilities we develop for spilling could be useful as
> > > temporary
> > > > caches.
> > > >   * There is an entire row based encoding and hash table built in
> there
> > > > somewhere.
> > > >
> > > > There are also a few things that I would love to see added but I
> don't
> > > > expect to be able to get to it myself anytime soon.  If anyone is
> > > > interested feel free to reach out and I'd be happy to brainstorm
> > > > implementation.  Off the top of my head:
> > > >
> > > >   * Support for window functions (for those of you that are not SQL
> > > insiders
> > > > this means "functions that rely on row order", like cumulative sum,
> > rank,
> > > > or lag)
> > > >     * We have most of the basic building blocks and so a relatively
> > naive
> > > > implementation shouldn't be a huge stretch.
> > > >   * Support for a streaming merge join (e.g. if the keys are ordered
> on
> > > both
> > > > inputs you don't have to accumulate all of one input)
> > > >
> > > > I welcome any input and acknowledge that any or all of this could
> > change
> > > > completely in the next 3 months.
> > > >
> > >
> >
>

Reply via email to