Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Wes McKinney Fri, 05 Aug 2022 14:17:59 -0700

hi Gosh — yes, you have it right. As the developer, you would have the
responsibility to manage the lifetime of the referenced memory (such
as wrapping any referenced data structures in an arrow::Buffer
subclass and attaching it to the underlying ArrayData object), but
definitely a major goal of this is to be able to construct string
arrays without having to do a bunch of allocation and string copying.
This would reduce intermediate copies en route to Parquet format, for
example.


On Wed, Aug 3, 2022 at 1:55 PM Gosh Arzumanyan <gosh...@gmail.com> wrote:
>
> Hi team!
>
> 2cents(maybe less): if I get the idea right, StringView data type might be
> very handy/optimal for cases where users already have string data in some
> other formats available (e.g. std::unordered_map<key,string>, flat
> buffer structures etc.)  Off which record batches are created and shipped
> to the wire. Seems like at the very least some intermediate copies can be
> skipped.
>
> Thanks,
> Gosh
>
> On Tue, Aug 2, 2022, 2:49 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou <anto...@python.org> wrote:
> > >
> > >
> > > Le 01/08/2022 à 19:13, Wes McKinney a écrit :
> > > >
> > > > If we start placing restrictions on how the out-of-line string buffers
> > > > are managed and externalized, it risks undermining the zero-copy
> > > > interoperability benefits that we're trying to achieve with this.
> > >
> > > But embedded pointers in turn undermine zero-copy for IPC and Flight.
> > > And they probably make transferring data between CPU and GPU more
> > > difficult and more expensive (unless the embedded pointers happen to
> > > fall into a piece of the address space shared between CPU and GPU: which
> > > you cannot ensure if, say, you got those pointers from a third party
> > > through the C data interface).
> > >
> > > So the bottom line seems to be that embedded pointers enable zero-copy
> > > for specific producers, but undermine existing zero-copy qualities for
> > > everyone (and, to speak more broadly, ease of data movement).
> >
> > If the proposal were for implementations to switch over to using these
> > StringViews for all of their string data, then I would agree with you.
> > But the proposal is for this memory layout to be available as an
> > "opt-in" for applications where it's beneficial — and the hypothesis
> > (to be supported with evidence, which requires doing some
> > implementation work) is that these benefits outweigh the costs
> > (additional serialization in some cross-language scenarios).
> >
> > Currently, an Arrow receiver of this data must perform an expensive
> > deserialization from the StringView representation for it to be
> > considered valid Arrow — no matter what is the intended use of the
> > data. In a way, we are "deferring" the deserialization until the data
> > is written out to IPC / Flight, or received by a transitive consumer
> > over the C interface.
> >
> > Similarly, applications that can achieve performance improvements
> > (e.g. query engines) by using the StringViews — I would guess that the
> > performance benefits outweigh the downstream serialization costs. For
> > example, I believe that the performance gains achieved in the Filter
> > (boolean selection) and Take (integer selection) operations alone will
> > be greater than the StringView<->String transformations that may need
> > to take place at application boundaries (where there is a receiver
> > that does not benefit from the StringView representation).
> >
> > Part of my goal for kicking off the implementation work is to be able
> > to quantify and demonstrate both the benefits and the costs, so that
> > we can make judgments based on real world data. I'm of the
> > "practicality beats purity" mindset on this, otherwise we introduce an
> > unavoidable tension that will lead query engine projects to choose not
> > to use Arrow as their columnar data representation.
> >
> > > In addition, the embedded pointers deviate from Arrow's representation
> > > philosophy, adding cognitive load for implementors who now have to
> > > account for the fact that buffers do not tell "everything about the
> > > data" but may refer to memory unknown to them. The discussions about how
> > > to support this in Go are a direct consequence of this deviation in
> > > philosophy.
> > >
> > > Overall, my opinion is that this is not a very good strategic choice for
> > > the project.
> > >
> > > Regards
> > >
> > > Antoine.
> >

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Reply via email to