On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou <anto...@python.org> wrote: > > > Le 01/08/2022 à 19:13, Wes McKinney a écrit : > > > > If we start placing restrictions on how the out-of-line string buffers > > are managed and externalized, it risks undermining the zero-copy > > interoperability benefits that we're trying to achieve with this. > > But embedded pointers in turn undermine zero-copy for IPC and Flight. > And they probably make transferring data between CPU and GPU more > difficult and more expensive (unless the embedded pointers happen to > fall into a piece of the address space shared between CPU and GPU: which > you cannot ensure if, say, you got those pointers from a third party > through the C data interface). > > So the bottom line seems to be that embedded pointers enable zero-copy > for specific producers, but undermine existing zero-copy qualities for > everyone (and, to speak more broadly, ease of data movement).
If the proposal were for implementations to switch over to using these StringViews for all of their string data, then I would agree with you. But the proposal is for this memory layout to be available as an "opt-in" for applications where it's beneficial — and the hypothesis (to be supported with evidence, which requires doing some implementation work) is that these benefits outweigh the costs (additional serialization in some cross-language scenarios). Currently, an Arrow receiver of this data must perform an expensive deserialization from the StringView representation for it to be considered valid Arrow — no matter what is the intended use of the data. In a way, we are "deferring" the deserialization until the data is written out to IPC / Flight, or received by a transitive consumer over the C interface. Similarly, applications that can achieve performance improvements (e.g. query engines) by using the StringViews — I would guess that the performance benefits outweigh the downstream serialization costs. For example, I believe that the performance gains achieved in the Filter (boolean selection) and Take (integer selection) operations alone will be greater than the StringView<->String transformations that may need to take place at application boundaries (where there is a receiver that does not benefit from the StringView representation). Part of my goal for kicking off the implementation work is to be able to quantify and demonstrate both the benefits and the costs, so that we can make judgments based on real world data. I'm of the "practicality beats purity" mindset on this, otherwise we introduce an unavoidable tension that will lead query engine projects to choose not to use Arrow as their columnar data representation. > In addition, the embedded pointers deviate from Arrow's representation > philosophy, adding cognitive load for implementors who now have to > account for the fact that buffers do not tell "everything about the > data" but may refer to memory unknown to them. The discussions about how > to support this in Go are a direct consequence of this deviation in > philosophy. > > Overall, my opinion is that this is not a very good strategic choice for > the project. > > Regards > > Antoine.