I started looking into this about 6 months ago and didn't follow
through the analysis completely.

As high level context, the columnar database literature (e.g. [1],
though the results probably differ on more modern processors as 16
years have passed) suggests that breaking data down into smaller
chunks for operator / expression evaluation results in better
aggregate throughput. It is also needed to do certain kinds of
intra-operator parallelism. For example, chunk sizes of 1000 to 32K
values are not uncommon.

The desire to break larger chunks of data into smaller ones either for
better CPU-cache-friendliness or to parallelize execution by
multithreading (e.g. using a work-stealing scheduler to use idle CPU
cores) turns out is at odds with the current formulation of
arrow::compute::ExecBatch and the relationship between ExecBatch and
functions in arrow::compute.

I had suspected that slicing ExecBatch to produce smaller ExecBatches
(e.g. to parallelize execution of an array expression, for example)
was too expensive, so I set out to quantify the overhead in the
following benchmark:

https://github.com/apache/arrow/pull/9280

To summarize these results, on my x86 laptop processor, I'm seeing a
cost of 80-100 nanoseconds to call `ArrayData::Slice` and related
overhead associated Datum/shared_ptr machinery in ExecBatch (I used
Flamegraph to analyze what our ExecBatchIterator spends its time
doing, and it's mostly spent in ArrayData::Slice). Slicing an
ExecBatch with 10 fields 1000 times (e.g. breaking 1M elements in 1024
element chunks) would thus take ~900 microseconds (by comparison,
performing scalar multiplication on a 1M element array of doubles
takes 350-400 microseconds on the same machine).

(it's possible I made mistakes in my analysis, so to make sure I'm not
crying wolf, someone else should take a closer look at the benchmark
formulation and results, too)

I would conclude from this that the formulation of ExecBatch and the
way that array kernels use ExecBatch is not very compatible with
finer-grained splitting of batches for parallel execution (or to keep
more data in cache during expression evaluation), since the overhead
of the splitting would undermine any performance benefits. I strongly
suggest, therefore, that we explore changes to ExecBatch to minimize
this overhead, such that ExecBatch::Slice or ExecBatchIterator::Next
are much less expensive.

I note also that a lesser portion of the runtime of this
splitting/iteration is taken up by interactions with shared_ptr, so as
a part of this I would challenge whether a vector<Datum> (and the
machinery of shared_ptr<ArrayData>) is always necessary, since many
ExecBatch objects need not be so concerned with reference-counted
memory ownership (so long as memory lifetime is being minded
*somewhere*, possibly at a higher level, since ExecBatch is a
non-public API).

I don't have the solution worked out for this, but the basic gist is:

* To be 10-100x more efficient ExecBatch slicing cannot call
ArrayData::Slice for every field like it does now
* Atomics associated with interacting with shared_ptr<ArrayData> /
shared_ptr<Buffer> do add meaningful overhead
* The way that array kernels are currently implemented would need to
shift to accommodate the changes needed to make ExecBatch lighter
weight.

One initial option is to move the "batch offset" to the top level of
ExecBatch (to remove the need to copy ArrayData), but then quite a bit
of code would need to be adapted to combine that offset with the
ArrayData's offsets to compute memory addresses. If this memory
address arithmetic has leaked into kernel implementations, this might
be a good opportunity to unleak it. That wouldn't fix the
shared_ptr/atomics overhead, so I'm open to ideas about how that could
be addressed also.

Thanks,
Wes

[1]: http://cidrdb.org/cidr2005/papers/P19.pdf

Reply via email to