I started looking into this about 6 months ago and didn't follow through the analysis completely.
As high level context, the columnar database literature (e.g. [1], though the results probably differ on more modern processors as 16 years have passed) suggests that breaking data down into smaller chunks for operator / expression evaluation results in better aggregate throughput. It is also needed to do certain kinds of intra-operator parallelism. For example, chunk sizes of 1000 to 32K values are not uncommon. The desire to break larger chunks of data into smaller ones either for better CPU-cache-friendliness or to parallelize execution by multithreading (e.g. using a work-stealing scheduler to use idle CPU cores) turns out is at odds with the current formulation of arrow::compute::ExecBatch and the relationship between ExecBatch and functions in arrow::compute. I had suspected that slicing ExecBatch to produce smaller ExecBatches (e.g. to parallelize execution of an array expression, for example) was too expensive, so I set out to quantify the overhead in the following benchmark: https://github.com/apache/arrow/pull/9280 To summarize these results, on my x86 laptop processor, I'm seeing a cost of 80-100 nanoseconds to call `ArrayData::Slice` and related overhead associated Datum/shared_ptr machinery in ExecBatch (I used Flamegraph to analyze what our ExecBatchIterator spends its time doing, and it's mostly spent in ArrayData::Slice). Slicing an ExecBatch with 10 fields 1000 times (e.g. breaking 1M elements in 1024 element chunks) would thus take ~900 microseconds (by comparison, performing scalar multiplication on a 1M element array of doubles takes 350-400 microseconds on the same machine). (it's possible I made mistakes in my analysis, so to make sure I'm not crying wolf, someone else should take a closer look at the benchmark formulation and results, too) I would conclude from this that the formulation of ExecBatch and the way that array kernels use ExecBatch is not very compatible with finer-grained splitting of batches for parallel execution (or to keep more data in cache during expression evaluation), since the overhead of the splitting would undermine any performance benefits. I strongly suggest, therefore, that we explore changes to ExecBatch to minimize this overhead, such that ExecBatch::Slice or ExecBatchIterator::Next are much less expensive. I note also that a lesser portion of the runtime of this splitting/iteration is taken up by interactions with shared_ptr, so as a part of this I would challenge whether a vector<Datum> (and the machinery of shared_ptr<ArrayData>) is always necessary, since many ExecBatch objects need not be so concerned with reference-counted memory ownership (so long as memory lifetime is being minded *somewhere*, possibly at a higher level, since ExecBatch is a non-public API). I don't have the solution worked out for this, but the basic gist is: * To be 10-100x more efficient ExecBatch slicing cannot call ArrayData::Slice for every field like it does now * Atomics associated with interacting with shared_ptr<ArrayData> / shared_ptr<Buffer> do add meaningful overhead * The way that array kernels are currently implemented would need to shift to accommodate the changes needed to make ExecBatch lighter weight. One initial option is to move the "batch offset" to the top level of ExecBatch (to remove the need to copy ArrayData), but then quite a bit of code would need to be adapted to combine that offset with the ArrayData's offsets to compute memory addresses. If this memory address arithmetic has leaked into kernel implementations, this might be a good opportunity to unleak it. That wouldn't fix the shared_ptr/atomics overhead, so I'm open to ideas about how that could be addressed also. Thanks, Wes [1]: http://cidrdb.org/cidr2005/papers/P19.pdf