I don't think the presence of array-level offsets precludes the presence of buffer-level offsets. For example, in the C++ implementation we have both buffer offsets and array offsets. Buffer offsets are used mainly in the IPC layer I think when we are constructing arrays from larger memory mapped regions. Array offsets are used when we need to slice arrays. At the C data interface a buffer is just void* and the responsibility for releasing the buffer remains with the producer so the consumer doesn't need to know that the void* is actually an offset into a larger range of memory or a fully self-contained allocation.
> but my understanding > is that this design choice affects the compute kernels of most > implementations, since they all perform a copy to de-offset the sliced > buffers on every operation over sliced arrays? I'm not sure what you are describing here. In the C++ impl I do not believe we perform a copy to de-offset sliced buffers. For example, given your boolean scenario, then we would have an input Array with an offset and that Array's data would contain a buffer for the validity (which could have its own independent offset). Since the output array would not have an offset we would need to create a new buffer but this would be a zero copy operation that references the original buffer (buffers in the C++ impl can have a shared pointer to a parent for reference counting purposes). Now, I'm not sure if we actually perform this optimization, but it should be possible. All of that being said, I can also understand your motivation, we have certainly had bugs in the C++ implementation in the past because kernels forgot to account for the array offset. I don't actually know the reason we need to keep the offset around after a slice operation but maybe someone else has more history on that decision. On Tue, Oct 26, 2021 at 9:31 AM Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > Hi, > > One aspect of the design of "arrow2" is that it deals with array slices > differently from the rest of the implementations. Essentially, the offset > is not stored in ArrayData, but on each individual Buffer. Some important > consequence are: > > * people can work with buffers and bitmaps without having to drag the > corresponding array offset with them (which are common source of > unsoundness in the official Rust implementation) > * arrays can store buffers/bitmaps with independent offsets > * it does not roundtrip over the c data interface at zero cost, because the > c data interface only allows a single offset per array, not per > buffer/bitmap. > > I have been benchmarking the consequences of this design choice and reached > the conclusion that storing the offset on a per buffer basis offers at > least 15% improvement in compute (results vary on kernel and likely > implementation). > > To understand why this is the case, consider comparing two boolean arrays > (a, b), where "a" has been sliced and has a validity and "b" does not. In > this case, we could compare the values of the arrays (taking into account > "a"'s offset), and clone "a"'s validity. However, this does not work > because the validity is "offsetted", while the result of the comparison of > the values is not. Thus, we need to create a shifted copy of the validity. > I measure 15% of the total compute time on my benches being done on > creating this shifted copy. > > The root cause is that the C data interface declares an offset on the > ArrayData, as opposed to an offset on each of the buffers contained on it. > With an offset shared between buffers, we can't slice individual bitmap > buffers, which forbids us from leveraging the optimization of simply > cloning buffers instead of copying them. > > I wonder whether this was discussed previously, or whether the "single > offset per array in the c data interface" considered this performance > implication. > > Atm the solution we adopted is to incur the penalty cost of ("de-offseting > buffers") when passing offsetted arrays via the c data interface, since > this way users benefit from faster compute kernels and only incur this cost > when it is strictly needed for the C data interface, but my understanding > is that this design choice affects the compute kernels of most > implementations, since they all perform a copy to de-offset the sliced > buffers on every operation over sliced arrays? > > Best, > Jorge