I don't think the presence of array-level offsets precludes the
presence of buffer-level offsets.  For example, in the C++
implementation we have both buffer offsets and array offsets.  Buffer
offsets are used mainly in the IPC layer I think when we are
constructing arrays from larger memory mapped regions.  Array offsets
are used when we need to slice arrays.  At the C data interface a
buffer is just void* and the responsibility for releasing the buffer
remains with the producer so the consumer doesn't need to know that
the void* is actually an offset into a larger range of memory or a
fully self-contained allocation.

> but my understanding
> is that this design choice affects the compute kernels of most
> implementations, since they all perform a copy to de-offset the sliced
> buffers on every operation over sliced arrays?

I'm not sure what you are describing here.  In the C++ impl I do not
believe we perform a copy to de-offset sliced buffers.  For example,
given your boolean scenario, then we would have an input Array with an
offset and that Array's data would contain a buffer for the validity
(which could have its own independent offset).  Since the output array
would not have an offset we would need to create a new buffer but this
would be a zero copy operation that references the original buffer
(buffers in the C++ impl can have a shared pointer to a parent for
reference counting purposes).  Now, I'm not sure if we actually
perform this optimization, but it should be possible.

All of that being said, I can also understand your motivation, we have
certainly had bugs in the C++ implementation in the past because
kernels forgot to account for the array offset.  I don't actually know
the reason we need to keep the offset around after a slice operation
but maybe someone else has more history on that decision.

On Tue, Oct 26, 2021 at 9:31 AM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:
>
> Hi,
>
> One aspect of the design of "arrow2" is that it deals with array slices
> differently from the rest of the implementations. Essentially, the offset
> is not stored in ArrayData, but on each individual Buffer. Some important
> consequence are:
>
> * people can work with buffers and bitmaps without having to drag the
> corresponding array offset with them (which are common source of
> unsoundness in the official Rust implementation)
> * arrays can store buffers/bitmaps with independent offsets
> * it does not roundtrip over the c data interface at zero cost, because the
> c data interface only allows a single offset per array, not per
> buffer/bitmap.
>
> I have been benchmarking the consequences of this design choice and reached
> the conclusion that storing the offset on a per buffer basis offers at
> least 15% improvement in compute (results vary on kernel and likely
> implementation).
>
> To understand why this is the case, consider comparing two boolean arrays
> (a, b), where "a" has been sliced and has a validity and "b" does not. In
> this case, we could compare the values of the arrays (taking into account
> "a"'s offset), and clone "a"'s validity. However, this does not work
> because the validity is "offsetted", while the result of the comparison of
> the values is not. Thus, we need to create a shifted copy of the validity.
> I measure 15% of the total compute time on my benches being done on
> creating this shifted copy.
>
> The root cause is that the C data interface declares an offset on the
> ArrayData, as opposed to an offset on each of the buffers contained on it.
> With an offset shared between buffers, we can't slice individual bitmap
> buffers, which forbids us from leveraging the optimization of simply
> cloning buffers instead of copying them.
>
> I wonder whether this was discussed previously, or whether the "single
> offset per array in the c data interface" considered this performance
> implication.
>
> Atm the solution we adopted is to incur the penalty cost of ("de-offseting
> buffers") when passing offsetted arrays via the c data interface, since
> this way users benefit from faster compute kernels and only incur this cost
> when it is strictly needed for the C data interface, but my understanding
> is that this design choice affects the compute kernels of most
> implementations, since they all perform a copy to de-offset the sliced
> buffers on every operation over sliced arrays?
>
> Best,
> Jorge

Reply via email to