We are getting closer to another release. I am thinking about what to work on in the next release. I think it is a good time to have a discussion about Acero in general. This is possibly also of interest to those working on pyarrow or r-arrow as these libraries rely on Acero for various functionality. Apache projects have no single owner and what follows is only my own personal opinion and plans. Still, I will apologize in advance for any lingering hubris or outrageous declarations of fact :)
First, some background. Since we started the project the landscape has changed. Most importantly, there are now more arrow-native execution engines. For example, datafusion, duckdb, velox, and I'm sure there are probably more. Substrait has also been created, allowing users to hopefully switch between different execution engines as different needs arise. Some significant contributors to Acero have taken a break or moved onto other projects and new contributors have arrived with new interests and goals (For example, an asof join node and more focus on ordered / streaming execution). I do not personally have the resources for bringing Acero's performance to match that of some of the other execution engines. I'm also not aware of any significant contributors attempting to do so. I also think that having yet another engine racing to the top of the TPC-H benchmarks is not the best thing we can be doing for our users. To be clear, our performance is not "bad" but it is not "state of the art". ## Some significant performance challenges for Acero: 1. Ideally an execution engine that wants to win TPC-H should operate on L2 sized batches. To risk stating the obvious: that is not very large. Typically less than 100k rows. At that size of operation the philosophy of "we are only doing this per-batch so we don't have to be worried about performance" falls apart. Significant pieces of Acero are not built to operate effectively at this small of a batch size. This is probably most evident in our expression evaluation and in queries that have complex expressions invoking many functions. 2. Our expression evaluation is missing a fair number of optimizations. The ability to use temporary vectors instead of allocating new vectors between function calls. Usage of selection vectors to avoid materializing filter results. General avoidance of allocation and preference for thread local data. 3. Writing a library of compute functions that is compact, able to run in any architecture, and able to take full advantage of the underlying hardware is an extremely difficult challenge and there are likely things that could be improved in our kernel functions. 4. Acero does no query optimization. Hopefully Substrait optimizers will emerge to fill this gap. In the meantime, this remains a significant gap when comparing Acero to most other execution engines. I am not (personally) planning on addressing any of the above issues (no time and little interest). Furthermore, other execution engines either already handle these things or they are investing significant funds to make sure they can. In fact, I would be in favor of explicitly abandoning the morsel-batch model and focusing on larger batch sizes in the spirit of simplicity. This does not mean that I want to abandon Acero. Acero is valuable for a number of users who don't need that last 20% of performance and would rather not introduce a new library. Acero has been a valuable building block for those that are exploring unique execution models or whose workloads don't cleanly fit into an SQL query. Acero has been used effectively for academic research. Acero has been valuable for me personally as a sort of "reference implementation" for a Substrait consumer as well as being a reference engine for connectivity and decentralization in general. ## My roadmap Over the next year I plan on transitioning more time into Substrait work. But this is less because I am abandoning Acero and more because I would like to start wrapping Acero up. In my mind, Acero as an "extensible streaming execution engine" is very nearly "complete" (as much as anything is ever complete). 1. One significant remaining challenge is getting some better tools in place for reducing runtime memory usage. This mostly equates to being smarter about scanning (in particular how we scan large row groups) and adding support for spilling to pipeline breakers (there is a promising PR for this that I have not yet been able to get around to). I would like to find time to address these things over the next year. 2. I would like Acero to be better documented and more extensible. It should be relatively simple (and hopefully as foolproof as possible) for users to create their own extension nodes. Perhaps we could even support python extension nodes. There has been some promising work around Substrait extension nodes which I think could be generalized to allow extension node developers to use Substrait without having to create .proto files. 3. Finally, pyarrow is a toolbox. I would like to see some of the internal compute utilities exposed as their own tools (and pyarrow bindings added). Significantly (though I don't think I'll get to all of these): * The ExecBatchBuilder is a useful accumulation tool. It could be used, for example, to load datasets into pandas that are almost as big as RAM (today you typically need at least 2x memory to convert to pandas). * The GroupingSegmenter could be used to support workflows like "Group by X and then give me a pandas dataframe for each group". * Whatever utilities we develop for spilling could be useful as temporary caches. * There is an entire row based encoding and hash table built in there somewhere. There are also a few things that I would love to see added but I don't expect to be able to get to it myself anytime soon. If anyone is interested feel free to reach out and I'd be happy to brainstorm implementation. Off the top of my head: * Support for window functions (for those of you that are not SQL insiders this means "functions that rely on row order", like cumulative sum, rank, or lag) * We have most of the basic building blocks and so a relatively naive implementation shouldn't be a huge stretch. * Support for a streaming merge join (e.g. if the keys are ordered on both inputs you don't have to accumulate all of one input) I welcome any input and acknowledge that any or all of this could change completely in the next 3 months.