Re: [Discuss] Single offset per array has a non-trivial performance implication

Micah Kornfield Tue, 26 Oct 2021 20:03:10 -0700

>
> To understand why this is the case, consider comparing two boolean arrays
> (a, b), where "a" has been sliced and has a validity and "b" does not. In
> this case, we could compare the values of the arrays (taking into account
> "a"'s offset), and clone "a"'s validity. However, this does not work
> because the validity is "offsetted", while the result of the comparison of
> the values is not. Thus, we need to create a shifted copy of the validity.



C++ has optimized routines for this that attempt to do this without copies
[1].  It has been a while since I looked  but I think in the worst case for
offsets, it might fall back to bit at a time operations, which I suppose
might actually be worse then the copy + shift).


> Since the output array
> would not have an offset we would need to create a new buffer but this
> would be a zero copy operation that references the original buffer
> (buffers in the C++ impl can have a shared pointer to a parent for
> reference counting purposes).  Now, I'm not sure if we actually
> perform this optimization, but it should be possible.

This might be obvious but It is a little more difficult than that, since
the offset represents a bit offset for validity buffer, so if you want it
to be zero copy, then you need to offset the value buffer as well (assuming
the offset isn't a multiple of 8).


All of that being said, I can also understand your motivation, we have
> certainly had bugs in the C++ implementation in the past because
> kernels forgot to account for the array offset.

A big +1 to this, covering all the edge cases with slices is pretty
complicated (there was at least one long standing bug related to this in
the 6.0 release).  I imagine there are potentially more lurking in the code
base.

I don't actually know
> the reason we need to keep the offset around after a slice operation
> but maybe someone else has more history on that decision.


These aren't necessarily strong justifications (I believe Wes originally
designed the class hierarchy so he would be the best person to comment):
1.   You would likely need a different member type (Bitmap) to track offset
+ validity buffer, instead of the raw validity buffer that is currently
stored in ArrayData.  I think the Bitmap class came sometime after the
original layout was defined.  So it might have simply been natural
evolution.
2.   Keeping the offset at the array level allows for laziness for lazy
slicing of child arrays for structs.
3.   Slightly higher space overhead for every buffer has a slice (this
likely was not the reason).

I remember there was a lot of discussion on including the slice in the
C-API at all because of the complexity, I don't recall anything about
individual buffer offsets, but it might be worth finding and looking over
that thread if there aren't enough answers provided here.

[1]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/util/bitmap_ops.cc






On Tue, Oct 26, 2021 at 6:27 PM Weston Pace <weston.p...@gmail.com> wrote:

> I don't think the presence of array-level offsets precludes the
> presence of buffer-level offsets.  For example, in the C++
> implementation we have both buffer offsets and array offsets.  Buffer
> offsets are used mainly in the IPC layer I think when we are
> constructing arrays from larger memory mapped regions.  Array offsets
> are used when we need to slice arrays.  At the C data interface a
> buffer is just void* and the responsibility for releasing the buffer
> remains with the producer so the consumer doesn't need to know that
> the void* is actually an offset into a larger range of memory or a
> fully self-contained allocation.
>
> > but my understanding
> > is that this design choice affects the compute kernels of most
> > implementations, since they all perform a copy to de-offset the sliced
> > buffers on every operation over sliced arrays?
>
> I'm not sure what you are describing here.  In the C++ impl I do not
> believe we perform a copy to de-offset sliced buffers.  For example,
> given your boolean scenario, then we would have an input Array with an
> offset and that Array's data would contain a buffer for the validity
> (which could have its own independent offset).  Since the output array
> would not have an offset we would need to create a new buffer but this
> would be a zero copy operation that references the original buffer
> (buffers in the C++ impl can have a shared pointer to a parent for
> reference counting purposes).  Now, I'm not sure if we actually
> perform this optimization, but it should be possible.
>
> All of that being said, I can also understand your motivation, we have
> certainly had bugs in the C++ implementation in the past because
> kernels forgot to account for the array offset.  I don't actually know
> the reason we need to keep the offset around after a slice operation
> but maybe someone else has more history on that decision.
>
> On Tue, Oct 26, 2021 at 9:31 AM Jorge Cardoso Leitão
> <jorgecarlei...@gmail.com> wrote:
> >
> > Hi,
> >
> > One aspect of the design of "arrow2" is that it deals with array slices
> > differently from the rest of the implementations. Essentially, the offset
> > is not stored in ArrayData, but on each individual Buffer. Some important
> > consequence are:
> >
> > * people can work with buffers and bitmaps without having to drag the
> > corresponding array offset with them (which are common source of
> > unsoundness in the official Rust implementation)
> > * arrays can store buffers/bitmaps with independent offsets
> > * it does not roundtrip over the c data interface at zero cost, because
> the
> > c data interface only allows a single offset per array, not per
> > buffer/bitmap.
> >
> > I have been benchmarking the consequences of this design choice and
> reached
> > the conclusion that storing the offset on a per buffer basis offers at
> > least 15% improvement in compute (results vary on kernel and likely
> > implementation).
> >
> > To understand why this is the case, consider comparing two boolean arrays
> > (a, b), where "a" has been sliced and has a validity and "b" does not. In
> > this case, we could compare the values of the arrays (taking into account
> > "a"'s offset), and clone "a"'s validity. However, this does not work
> > because the validity is "offsetted", while the result of the comparison
> of
> > the values is not. Thus, we need to create a shifted copy of the
> validity.
> > I measure 15% of the total compute time on my benches being done on
> > creating this shifted copy.
> >
> > The root cause is that the C data interface declares an offset on the
> > ArrayData, as opposed to an offset on each of the buffers contained on
> it.
> > With an offset shared between buffers, we can't slice individual bitmap
> > buffers, which forbids us from leveraging the optimization of simply
> > cloning buffers instead of copying them.
> >
> > I wonder whether this was discussed previously, or whether the "single
> > offset per array in the c data interface" considered this performance
> > implication.
> >
> > Atm the solution we adopted is to incur the penalty cost of
> ("de-offseting
> > buffers") when passing offsetted arrays via the c data interface, since
> > this way users benefit from faster compute kernels and only incur this
> cost
> > when it is strictly needed for the C data interface, but my understanding
> > is that this design choice affects the compute kernels of most
> > implementations, since they all perform a copy to de-offset the sliced
> > buffers on every operation over sliced arrays?
> >
> > Best,
> > Jorge
>

Re: [Discuss] Single offset per array has a non-trivial performance implication

Reply via email to