Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Wes McKinney Tue, 02 Aug 2022 11:49:34 -0700

On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 01/08/2022 à 19:13, Wes McKinney a écrit :
> >
> > If we start placing restrictions on how the out-of-line string buffers
> > are managed and externalized, it risks undermining the zero-copy
> > interoperability benefits that we're trying to achieve with this.
>
> But embedded pointers in turn undermine zero-copy for IPC and Flight.
> And they probably make transferring data between CPU and GPU more
> difficult and more expensive (unless the embedded pointers happen to
> fall into a piece of the address space shared between CPU and GPU: which
> you cannot ensure if, say, you got those pointers from a third party
> through the C data interface).
>
> So the bottom line seems to be that embedded pointers enable zero-copy
> for specific producers, but undermine existing zero-copy qualities for
> everyone (and, to speak more broadly, ease of data movement).


If the proposal were for implementations to switch over to using these
StringViews for all of their string data, then I would agree with you.
But the proposal is for this memory layout to be available as an
"opt-in" for applications where it's beneficial — and the hypothesis
(to be supported with evidence, which requires doing some
implementation work) is that these benefits outweigh the costs
(additional serialization in some cross-language scenarios).

Currently, an Arrow receiver of this data must perform an expensive
deserialization from the StringView representation for it to be
considered valid Arrow — no matter what is the intended use of the
data. In a way, we are "deferring" the deserialization until the data
is written out to IPC / Flight, or received by a transitive consumer
over the C interface.

Similarly, applications that can achieve performance improvements
(e.g. query engines) by using the StringViews — I would guess that the
performance benefits outweigh the downstream serialization costs. For
example, I believe that the performance gains achieved in the Filter
(boolean selection) and Take (integer selection) operations alone will
be greater than the StringView<->String transformations that may need
to take place at application boundaries (where there is a receiver
that does not benefit from the StringView representation).

Part of my goal for kicking off the implementation work is to be able
to quantify and demonstrate both the benefits and the costs, so that
we can make judgments based on real world data. I'm of the
"practicality beats purity" mindset on this, otherwise we introduce an
unavoidable tension that will lead query engine projects to choose not
to use Arrow as their columnar data representation.

> In addition, the embedded pointers deviate from Arrow's representation
> philosophy, adding cognitive load for implementors who now have to
> account for the fact that buffers do not tell "everything about the
> data" but may refer to memory unknown to them. The discussions about how
> to support this in Go are a direct consequence of this deviation in
> philosophy.
>
> Overall, my opinion is that this is not a very good strategic choice for
> the project.
>
> Regards
>
> Antoine.

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Reply via email to