It's obviously preferrable to be zero-copy but it's certainly not mandatory, especially as the data being shared is assumed to be read-only in most use cases.
In which case we should probably remove the comment about alignment from the C interface specification, and highlight that implementations may copy when needed. When I have suggested this in the past I have gotten push back, implying at least some contingent beyond myself feels FFI should be zero-copy.

Unaligned buffers typically occur when the IO reader achieves zero copy from its data source (for example a memory-mapped file, a gRPC buffer, a Python memoryview...). Otherwise, if the buffer was allocated by, say, Arrow C++, it will be aligned to a 64 byte boundary.
I am aware, but arrow-rs has no way to know this, if we always copy by default we are potentially copying data when it could have been avoided by a more enlightened reader. Yes in this particular case we are just moving the copy from in arrow-cpp to arrow-rs, and so the difference is likely immaterial. An alternate example might be arrow-go, where at least one of the ways it was allocating memory wasn't producing aligned buffers [1].

How would that work? You typically don't control how gRPC + protobuf allocate buffer data. Similarly, if your IPC stream is wrapped in a HTTP response, you don't control how the HTTP implementation laid out the response stream.
It would be complicated, the protobuf framing of flight makes avoiding copying whilst preserving alignment very hard, however, as they are network protocols this perhaps doesn't matter. A protocol that instead sent IPC streams over bare-HTTP would likely be able to do this more easily.

[1]: https://github.com/apache/arrow-go/issues/282

On 27/03/2025 17:01, Antoine Pitrou wrote:

Hello,

Le 27/03/2025 à 17:53, Raphael Taylor-Davies a écrit :

The current ambiguity, however, makes it hard to set reasonable
defaults, as it isn't clear if FFI should be zero-copy and therefore
have alignment restrictions or not.

It's obviously preferrable to be zero-copy but it's certainly not mandatory, especially as the data being shared is assumed to be read-only in most use cases.

> IMO it makes the most sense to
require at least natural alignment and push this to where the IO occurs,
i.e. the IPC reader, as that way it is in many cases possible to avoid
copying the data twice, and even if not enforced unaligned buffers have
potential performance problems regardless.

I don't really understand what "copying the data twice" alludes to here.
Unaligned buffers typically occur when the IO reader achieves zero copy from its data source (for example a memory-mapped file, a gRPC buffer, a Python memoryview...). Otherwise, if the buffer was allocated by, say, Arrow C++, it will be aligned to a 64 byte boundary.

So I don't understand in which circumstances you would get an additional data copy if you ensure alignement in arrow-rs.

I can understand
the desire to provide zero-copy, but then it should look to do this in a
way that preserves alignment.

How would that work? You typically don't control how gRPC + protobuf allocate buffer data. Similarly, if your IPC stream is wrapped in a HTTP response, you don't control how the HTTP implementation laid out the response stream.

Regards

Antoine.

Reply via email to