Re: [DISCUSS] Acero roadmap / philosophy

Will Ayd Mon, 10 Apr 2023 16:36:53 -0700

I am still wrapping my head around some of the technologies so excuseany ignorance, but seeing as the OP mentioned the use case of /switching/between execution engines is there not a gap if the concern is moreabout /combining/ execution engines? AFAIU Substrait would allow me tosubmit different queries to DuckDB and Datafusion - if I wanted to takethese results back and combine them without considering the source theycame from is Acero not the right tool for the job?


On 3/14/23 11:50, Li Jin wrote:

Late to the party.


Thanks Weston for sharing the thoughts around Acero. We are actually a
pretty heavy Acero user right now and are trying to take part in Acero
maintenance and development. Internally we are using Acero for a time
series streaming data processing system.

I would +1 on many of Weston's directions here, in particular to make Acero
extensionable / customizable. IMO Acero might not be the fastest "Arrow
SQL/TPC-H" engine, but the ability to customize it for ordered time series
is a huge/kill feature.

In addition to what Weston has already said, my other two cents is that I
think Acero would benefit from a separation from the Arrow core C++
library, similar to how Arrow Flight is. The main reason is that Arrow core
being such a widely used library, it benefits more from being stable and
Acero being a relatively new and standalone component, benefits more from
fast moving / quick experiment. My colleague and I are working on
https://github.com/apache/arrow/issues/15280  to make this happen.





On Fri, Mar 10, 2023 at 5:59 AM Andrew Lamb<al...@influxdata.com>  wrote:

I don't know much about the Acero user base, but gathering some significant
term users (e.g. Ballista, Urban Logiq, GreptimeDB, InfluxDB IOx, etc) has
been very helpful for DataFusion. Not only do such users bring some amount
of maintenance capacity, but perhaps more relevantly to your discussion
they bring a focus to the project with their usecases.

With so many possible tradeoffs (e.g. streaming vs larger batch execution
as you mention above) having people to help focus the choice of project I
think has served DataFusion well.

If Acero has such users (or potential users) perhaps reaching out to them /
soliciting their ideas of where they want to see the project go would be a
valuable focusing exercise.

Andrew

On Thu, Mar 9, 2023 at 6:35 PM Aldrin<akmon...@ucsc.edu.invalid>  wrote:

Thanks for sharing! There are a variety of things that I didn't know

about

(such as ExecBatchBuilder) and it's interesting to hear about the
performance challenges.

How much would future substrait work involve integration with Acero? I'm
curious how much more support of substrait is seen as valuable (should be
prioritized) or
if additional support is going to be "as-needed". Note that I have a
minimal understanding of how "large" substrait is and what proportion of

it

is already supported by
Acero.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Mar 9, 2023 at 12:33 PM Antoine Pitrou<anto...@python.org>

wrote:

Just a reminder for those following other implementations of Arrow,

that

Acero is the compute/execution engine subsystem baked into Arrow C++.

Regards

Antoine.


Le 09/03/2023 à 21:20, Weston Pace a écrit :

We are getting closer to another release.  I am thinking about what

to

work

on in the next release.  I think it is a good time to have a

discussion

about Acero in general.  This is possibly also of interest to those

working

on pyarrow or r-arrow as these libraries rely on Acero for various
functionality.  Apache projects have no single owner and what follows

is

only my own personal opinion and plans.  Still, I will apologize in

advance

for any lingering hubris or outrageous declarations of fact :)

First, some background.  Since we started the project the landscape

has

changed.  Most importantly, there are now more arrow-native execution
engines.  For example, datafusion, duckdb, velox, and I'm sure there

are

probably more.  Substrait has also been created, allowing users to
hopefully switch between different execution engines as different

needs

arise.  Some significant contributors to Acero have taken a break or

moved

onto other projects and new contributors have arrived with new

interests

and goals (For example, an asof join node and more focus on ordered /
streaming execution).

I do not personally have the resources for bringing Acero's

performance

to

match that of some of the other execution engines.  I'm also not

aware

of

any significant contributors attempting to do so.  I also think that

having

yet another engine racing to the top of the TPC-H benchmarks is not

the

best thing we can be doing for our users.  To be clear, our

performance

is

not "bad" but it is not "state of the art".

## Some significant performance challenges for Acero:

   1. Ideally an execution engine that wants to win TPC-H should

operate

on

L2 sized batches.  To risk stating the obvious: that is not very

large.

Typically less than 100k rows.  At that size of operation the

philosophy

of

"we are only doing this per-batch so we don't have to be worried

about

performance" falls apart.  Significant pieces of Acero are not built

to

operate effectively at this small of a batch size.  This is probably

most

evident in our expression evaluation and in queries that have complex
expressions invoking many functions.

   2. Our expression evaluation is missing a fair number of

optimizations.

The ability to use temporary vectors instead of allocating new

vectors

between function calls.  Usage of selection vectors to avoid

materializing

filter results.  General avoidance of allocation and preference for

thread

local data.

   3. Writing a library of compute functions that is compact, able to

run

in

any architecture, and able to take full advantage of the underlying
hardware is an extremely difficult challenge and there are likely

things

that could be improved in our kernel functions.

   4. Acero does no query optimization.  Hopefully Substrait

optimizers

will

emerge to fill this gap.  In the meantime, this remains a significant

gap

when comparing Acero to most other execution engines.

I am not (personally) planning on addressing any of the above issues

(no

time and little interest).  Furthermore, other execution engines

either

already handle these things or they are investing significant funds

to

make

sure they can.  In fact, I would be in favor of explicitly abandoning

the

morsel-batch model and focusing on larger batch sizes in the spirit

of

simplicity.

This does not mean that I want to abandon Acero.  Acero is valuable

for a

number of users who don't need that last 20% of performance and would
rather not introduce a new library.  Acero has been a valuable

building

block for those that are exploring unique execution models or whose
workloads don't cleanly fit into an SQL query.  Acero has been used
effectively for academic research.  Acero has been valuable for me
personally as a sort of "reference implementation" for a Substrait

consumer

as well as being a reference engine for connectivity and

decentralization

in general.

## My roadmap

Over the next year I plan on transitioning more time into Substrait

work.

But this is less because I am abandoning Acero and more because I

would

like to start wrapping Acero up.  In my mind, Acero as an "extensible
streaming execution engine" is very nearly "complete" (as much as

anything

is ever complete).

1. One significant remaining challenge is getting some better tools

in

place for reducing runtime memory usage.  This mostly equates to

being

smarter about scanning (in particular how we scan large row groups)

and

adding support for spilling to pipeline breakers (there is a

promising

PR

for this that I have not yet been able to get around to).  I would

like

to

find time to address these things over the next year.

2. I would like Acero to be better documented and more extensible.

It

should be relatively simple (and hopefully as foolproof as possible)

for

users to create their own extension nodes.  Perhaps we could even

support

python extension nodes.  There has been some promising work around
Substrait extension nodes which I think could be generalized to allow
extension node developers to use Substrait without having to create

.proto

files.

3. Finally, pyarrow is a toolbox.  I would like to see some of the

internal

compute utilities exposed as their own tools (and pyarrow bindings

added).

Significantly (though I don't think I'll get to all of these):

   * The ExecBatchBuilder is a useful accumulation tool.  It could be

used,

for example, to load datasets into pandas that are almost as big as

RAM

(today you typically need at least 2x memory to convert to pandas).
   * The GroupingSegmenter could be used to support workflows like

"Group

by

X and then give me a pandas dataframe for each group".
   * Whatever utilities we develop for spilling could be useful as

temporary

caches.
   * There is an entire row based encoding and hash table built in

there

somewhere.

There are also a few things that I would love to see added but I

don't

expect to be able to get to it myself anytime soon.  If anyone is
interested feel free to reach out and I'd be happy to brainstorm
implementation.  Off the top of my head:

   * Support for window functions (for those of you that are not SQL

insiders

this means "functions that rely on row order", like cumulative sum,

rank,

or lag)
     * We have most of the basic building blocks and so a relatively

naive

implementation shouldn't be a huge stretch.
   * Support for a streaming merge join (e.g. if the keys are ordered

on

both

inputs you don't have to accumulate all of one input)

I welcome any input and acknowledge that any or all of this could

change

completely in the next 3 months.

Re: [DISCUSS] Acero roadmap / philosophy

Reply via email to