Thanks for sharing! There are a variety of things that I didn't know about (such as ExecBatchBuilder) and it's interesting to hear about the performance challenges.
How much would future substrait work involve integration with Acero? I'm curious how much more support of substrait is seen as valuable (should be prioritized) or if additional support is going to be "as-needed". Note that I have a minimal understanding of how "large" substrait is and what proportion of it is already supported by Acero. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 9, 2023 at 12:33 PM Antoine Pitrou <anto...@python.org> wrote: > > Just a reminder for those following other implementations of Arrow, that > Acero is the compute/execution engine subsystem baked into Arrow C++. > > Regards > > Antoine. > > > Le 09/03/2023 à 21:20, Weston Pace a écrit : > > We are getting closer to another release. I am thinking about what to > work > > on in the next release. I think it is a good time to have a discussion > > about Acero in general. This is possibly also of interest to those > working > > on pyarrow or r-arrow as these libraries rely on Acero for various > > functionality. Apache projects have no single owner and what follows is > > only my own personal opinion and plans. Still, I will apologize in > advance > > for any lingering hubris or outrageous declarations of fact :) > > > > First, some background. Since we started the project the landscape has > > changed. Most importantly, there are now more arrow-native execution > > engines. For example, datafusion, duckdb, velox, and I'm sure there are > > probably more. Substrait has also been created, allowing users to > > hopefully switch between different execution engines as different needs > > arise. Some significant contributors to Acero have taken a break or > moved > > onto other projects and new contributors have arrived with new interests > > and goals (For example, an asof join node and more focus on ordered / > > streaming execution). > > > > I do not personally have the resources for bringing Acero's performance > to > > match that of some of the other execution engines. I'm also not aware of > > any significant contributors attempting to do so. I also think that > having > > yet another engine racing to the top of the TPC-H benchmarks is not the > > best thing we can be doing for our users. To be clear, our performance > is > > not "bad" but it is not "state of the art". > > > > ## Some significant performance challenges for Acero: > > > > 1. Ideally an execution engine that wants to win TPC-H should operate > on > > L2 sized batches. To risk stating the obvious: that is not very large. > > Typically less than 100k rows. At that size of operation the philosophy > of > > "we are only doing this per-batch so we don't have to be worried about > > performance" falls apart. Significant pieces of Acero are not built to > > operate effectively at this small of a batch size. This is probably most > > evident in our expression evaluation and in queries that have complex > > expressions invoking many functions. > > > > 2. Our expression evaluation is missing a fair number of optimizations. > > The ability to use temporary vectors instead of allocating new vectors > > between function calls. Usage of selection vectors to avoid > materializing > > filter results. General avoidance of allocation and preference for > thread > > local data. > > > > 3. Writing a library of compute functions that is compact, able to run > in > > any architecture, and able to take full advantage of the underlying > > hardware is an extremely difficult challenge and there are likely things > > that could be improved in our kernel functions. > > > > 4. Acero does no query optimization. Hopefully Substrait optimizers > will > > emerge to fill this gap. In the meantime, this remains a significant gap > > when comparing Acero to most other execution engines. > > > > I am not (personally) planning on addressing any of the above issues (no > > time and little interest). Furthermore, other execution engines either > > already handle these things or they are investing significant funds to > make > > sure they can. In fact, I would be in favor of explicitly abandoning the > > morsel-batch model and focusing on larger batch sizes in the spirit of > > simplicity. > > > > This does not mean that I want to abandon Acero. Acero is valuable for a > > number of users who don't need that last 20% of performance and would > > rather not introduce a new library. Acero has been a valuable building > > block for those that are exploring unique execution models or whose > > workloads don't cleanly fit into an SQL query. Acero has been used > > effectively for academic research. Acero has been valuable for me > > personally as a sort of "reference implementation" for a Substrait > consumer > > as well as being a reference engine for connectivity and decentralization > > in general. > > > > ## My roadmap > > > > Over the next year I plan on transitioning more time into Substrait work. > > But this is less because I am abandoning Acero and more because I would > > like to start wrapping Acero up. In my mind, Acero as an "extensible > > streaming execution engine" is very nearly "complete" (as much as > anything > > is ever complete). > > > > 1. One significant remaining challenge is getting some better tools in > > place for reducing runtime memory usage. This mostly equates to being > > smarter about scanning (in particular how we scan large row groups) and > > adding support for spilling to pipeline breakers (there is a promising PR > > for this that I have not yet been able to get around to). I would like > to > > find time to address these things over the next year. > > > > 2. I would like Acero to be better documented and more extensible. It > > should be relatively simple (and hopefully as foolproof as possible) for > > users to create their own extension nodes. Perhaps we could even support > > python extension nodes. There has been some promising work around > > Substrait extension nodes which I think could be generalized to allow > > extension node developers to use Substrait without having to create > .proto > > files. > > > > 3. Finally, pyarrow is a toolbox. I would like to see some of the > internal > > compute utilities exposed as their own tools (and pyarrow bindings > added). > > Significantly (though I don't think I'll get to all of these): > > > > * The ExecBatchBuilder is a useful accumulation tool. It could be > used, > > for example, to load datasets into pandas that are almost as big as RAM > > (today you typically need at least 2x memory to convert to pandas). > > * The GroupingSegmenter could be used to support workflows like "Group > by > > X and then give me a pandas dataframe for each group". > > * Whatever utilities we develop for spilling could be useful as > temporary > > caches. > > * There is an entire row based encoding and hash table built in there > > somewhere. > > > > There are also a few things that I would love to see added but I don't > > expect to be able to get to it myself anytime soon. If anyone is > > interested feel free to reach out and I'd be happy to brainstorm > > implementation. Off the top of my head: > > > > * Support for window functions (for those of you that are not SQL > insiders > > this means "functions that rely on row order", like cumulative sum, rank, > > or lag) > > * We have most of the basic building blocks and so a relatively naive > > implementation shouldn't be a huge stretch. > > * Support for a streaming merge join (e.g. if the keys are ordered on > both > > inputs you don't have to accumulate all of one input) > > > > I welcome any input and acknowledge that any or all of this could change > > completely in the next 3 months. > > >