hi Gosh — yes, you have it right. As the developer, you would have the responsibility to manage the lifetime of the referenced memory (such as wrapping any referenced data structures in an arrow::Buffer subclass and attaching it to the underlying ArrayData object), but definitely a major goal of this is to be able to construct string arrays without having to do a bunch of allocation and string copying. This would reduce intermediate copies en route to Parquet format, for example.
On Wed, Aug 3, 2022 at 1:55 PM Gosh Arzumanyan <gosh...@gmail.com> wrote: > > Hi team! > > 2cents(maybe less): if I get the idea right, StringView data type might be > very handy/optimal for cases where users already have string data in some > other formats available (e.g. std::unordered_map<key,string>, flat > buffer structures etc.) Off which record batches are created and shipped > to the wire. Seems like at the very least some intermediate copies can be > skipped. > > Thanks, > Gosh > > On Tue, Aug 2, 2022, 2:49 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Le 01/08/2022 à 19:13, Wes McKinney a écrit : > > > > > > > > If we start placing restrictions on how the out-of-line string buffers > > > > are managed and externalized, it risks undermining the zero-copy > > > > interoperability benefits that we're trying to achieve with this. > > > > > > But embedded pointers in turn undermine zero-copy for IPC and Flight. > > > And they probably make transferring data between CPU and GPU more > > > difficult and more expensive (unless the embedded pointers happen to > > > fall into a piece of the address space shared between CPU and GPU: which > > > you cannot ensure if, say, you got those pointers from a third party > > > through the C data interface). > > > > > > So the bottom line seems to be that embedded pointers enable zero-copy > > > for specific producers, but undermine existing zero-copy qualities for > > > everyone (and, to speak more broadly, ease of data movement). > > > > If the proposal were for implementations to switch over to using these > > StringViews for all of their string data, then I would agree with you. > > But the proposal is for this memory layout to be available as an > > "opt-in" for applications where it's beneficial — and the hypothesis > > (to be supported with evidence, which requires doing some > > implementation work) is that these benefits outweigh the costs > > (additional serialization in some cross-language scenarios). > > > > Currently, an Arrow receiver of this data must perform an expensive > > deserialization from the StringView representation for it to be > > considered valid Arrow — no matter what is the intended use of the > > data. In a way, we are "deferring" the deserialization until the data > > is written out to IPC / Flight, or received by a transitive consumer > > over the C interface. > > > > Similarly, applications that can achieve performance improvements > > (e.g. query engines) by using the StringViews — I would guess that the > > performance benefits outweigh the downstream serialization costs. For > > example, I believe that the performance gains achieved in the Filter > > (boolean selection) and Take (integer selection) operations alone will > > be greater than the StringView<->String transformations that may need > > to take place at application boundaries (where there is a receiver > > that does not benefit from the StringView representation). > > > > Part of my goal for kicking off the implementation work is to be able > > to quantify and demonstrate both the benefits and the costs, so that > > we can make judgments based on real world data. I'm of the > > "practicality beats purity" mindset on this, otherwise we introduce an > > unavoidable tension that will lead query engine projects to choose not > > to use Arrow as their columnar data representation. > > > > > In addition, the embedded pointers deviate from Arrow's representation > > > philosophy, adding cognitive load for implementors who now have to > > > account for the fact that buffers do not tell "everything about the > > > data" but may refer to memory unknown to them. The discussions about how > > > to support this in Go are a direct consequence of this deviation in > > > philosophy. > > > > > > Overall, my opinion is that this is not a very good strategic choice for > > > the project. > > > > > > Regards > > > > > > Antoine. > >