It's obviously preferrable to be zero-copy but it's certainly not
mandatory, especially as the data being shared is assumed to be
read-only in most use cases.
In which case we should probably remove the comment about alignment from
the C interface specification, and highlight that implementations may
copy when needed. When I have suggested this in the past I have gotten
push back, implying at least some contingent beyond myself feels FFI
should be zero-copy.
Unaligned buffers typically occur when the IO reader achieves zero
copy from its data source (for example a memory-mapped file, a gRPC
buffer, a Python memoryview...). Otherwise, if the buffer was
allocated by, say, Arrow C++, it will be aligned to a 64 byte boundary.
I am aware, but arrow-rs has no way to know this, if we always copy by
default we are potentially copying data when it could have been avoided
by a more enlightened reader. Yes in this particular case we are just
moving the copy from in arrow-cpp to arrow-rs, and so the difference is
likely immaterial. An alternate example might be arrow-go, where at
least one of the ways it was allocating memory wasn't producing aligned
buffers [1].
How would that work? You typically don't control how gRPC + protobuf
allocate buffer data. Similarly, if your IPC stream is wrapped in a
HTTP response, you don't control how the HTTP implementation laid out
the response stream.
It would be complicated, the protobuf framing of flight makes avoiding
copying whilst preserving alignment very hard, however, as they are
network protocols this perhaps doesn't matter. A protocol that instead
sent IPC streams over bare-HTTP would likely be able to do this more easily.
[1]: https://github.com/apache/arrow-go/issues/282
On 27/03/2025 17:01, Antoine Pitrou wrote:
Hello,
Le 27/03/2025 à 17:53, Raphael Taylor-Davies a écrit :
The current ambiguity, however, makes it hard to set reasonable
defaults, as it isn't clear if FFI should be zero-copy and therefore
have alignment restrictions or not.
It's obviously preferrable to be zero-copy but it's certainly not
mandatory, especially as the data being shared is assumed to be
read-only in most use cases.
> IMO it makes the most sense to
require at least natural alignment and push this to where the IO occurs,
i.e. the IPC reader, as that way it is in many cases possible to avoid
copying the data twice, and even if not enforced unaligned buffers have
potential performance problems regardless.
I don't really understand what "copying the data twice" alludes to here.
Unaligned buffers typically occur when the IO reader achieves zero
copy from its data source (for example a memory-mapped file, a gRPC
buffer, a Python memoryview...). Otherwise, if the buffer was
allocated by, say, Arrow C++, it will be aligned to a 64 byte boundary.
So I don't understand in which circumstances you would get an
additional data copy if you ensure alignement in arrow-rs.
I can understand
the desire to provide zero-copy, but then it should look to do this in a
way that preserves alignment.
How would that work? You typically don't control how gRPC + protobuf
allocate buffer data. Similarly, if your IPC stream is wrapped in a
HTTP response, you don't control how the HTTP implementation laid out
the response stream.
Regards
Antoine.