Re: [DISCUSS][C++] Raw pointer string views

Felipe Oliveira Carvalho Thu, 28 Sep 2023 11:13:55 -0700

My take here is that Ben did an excellent job in hiding the fact that C++
has two variations of the format without leaking the pointer version via
the interfaces through which Arrow arrays are communicated to other
implementations.


As things stand right now, there is no zero-copy transfer of pointer-based
string views. Ben can give the final authoritative answer on this. The idea
of zero-copy transfers was discussed but decided against to avoid adding a
format to the spec that can't be implemented by languages that can't cast
arbitrary memory bytes to objects (the case for many languages that are not
C or C++).

Having established that the spec is not "polluted" by a format that only
systems-languages can implement, we can look at the constraint of keeping
implementations completely faithful to the spec:

Pros:
 - The reference implementations serve as an alternative to the spec text
in being a one-to-one translation of the spec

Cons:
- Performance loss (it's hard to predict how many optimizations can be lost
by forcing an extra memory indirection when looping)
- Insensibility to the ergonomics afforded by the language

Variations are bound to happen any time a language doesn't afford good
usability without conversions every time the data is used. In JavaScript,
for instance, the use of UTF-16 is much more widespread than the use of
UTF-8. It would make sense for a JavaScript implementations to keep string
arrays in UTF-16 at rest.

Sometimes software specs are accompanied by two types of implementations:
the reference implementation that tries to be simple and didactic; and
implementations used in practice because they are allowed to deviate
internally, doing things in a more complicated way than the spec requires,
to achieve some practical advantage. Are all the implementations in the
apache/arrow of the first kind?

--
Felipe

On Thu, Sep 28, 2023 at 1:10 PM Andrew Lamb <[email protected]> wrote:

> > What this PR is creating is an "unofficial" Arrow format, with data
> types exposed in Arrow C++ that are not part of the Arrow standard, but
> are exposed as if they were.
>
> I agree with Antoine here. It seems a pretty clear cut story of the C++
> implementation doesn't follow the spec and thus we should either
> 1.  Update the standard to allow raw pointers
> 2.  fix the C++ implementation to not have them / treat them as though they
> were
>
> If the core usecase is "arrow has the same in memory format used by DuckDB
> and Velox, and those systems can't/won't change their implementations" it
> seems like the only path forward for that usecase is to adopt their model
> (raw pointers) directly. Maybe I am missing something
>
>
> Andrew
>
>
>
>
>
>
> On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies
> <[email protected]> wrote:
>
> > FWIW Rust wouldn't have issues using raw pointers, I can't speak for
> other
> > languages though. They would be more expensive to validate, but
> validation
> > is not going to be cheap regardless.
> >
> > I could definitely see a world where view types use pointers and IPC
> > coerces to/from the large non-view types. IPC has to copy the string data
> > regardless and re-encoding would avoid encoding masked data.
> >
> > The notion of supporting both is less of an exciting prospect... I'm also
> > not sure if it is too late to make changes at this stage.
> >
> > On 28 September 2023 15:26:57 BST, Wes McKinney <[email protected]>
> > wrote:
> > >hi all,
> > >
> > >I'm just catching up on this thread after having taken a look at the
> > format
> > >PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
> > >from having spent a great deal less time on this project than others.
> > >
> > >The original motivation I had for bringing up the idea of adding the
> > >StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
> > >format (though not necessarily the IPC format) was to provide a path for
> > >zero-copy interoperability in some cases with these systems when dealing
> > >with strings, and to enhance performance within Arrow-applications
> > (setting
> > >aside the external interop goal) in scenarios where being able to point
> to
> > >external memory spaces could avoid a copy-and-repack step. I think it's
> > >useful to have an zero-copy IPC-compatible string format (i.e. what was
> > >proposed and merged into Columnar.rst) for that allows for out-of-order
> > >construction or arrays, reuse of memory (e.g. consider the case of
> > decoding
> > >dictionary encoding Parquet data — not having to copy strings many times
> > >when rehydrating string arrays), and chunked allocation — all good
> things
> > >that the existing Arrow VarBinary layout does not provide for.
> > >
> > >For the in-memory side of things, I am somewhat more of Antoine's
> > >perspective that trying to have both in-memory (index+offset and raw
> > >pointers) creates a kind of uncanny valley situation that may confuse
> > users
> > >and cause other problems (especially if the raw pointer version is only
> > >found in the C++ library). The raw pointer version also cannot be
> > >validated, but I see validation as less of a requirement and more of a
> > >"nice to have" (I realize others see validation as more of a
> requirement).
> > >
> > >* I see the raw-pointer type has having more net utility (going back to
> > the
> > >original motivation), but I also see how it is problematic for some
> > non-C++
> > >implementations.
> > >* The index-offset version is intrinsic value over the existing "dense"
> > >varbinary layout (per some of the benefits above) but does not satisfy
> the
> > >external interoperability goal with systems that are becoming more
> popular
> > >month over month
> > >* Incoming data from external systems that use the raw pointer model
> have
> > >to be serialized (and perhaps repacked) to the index-offset model. This
> > >isn't ideal — going the other way (from index-offset to raw pointer) is
> > >just a pointer swizzle, comparatively inexpensive.
> > >
> > >So it seems like we have several paths available, none of them wholly
> > >satisfactory:
> > >
> > >1. Essentially what's in the existing PR — the raw pointer variant which
> > is
> > >"non-standard"
> > >2. Pick one and only one for in memory — I think the raw pointer version
> > is
> > >more useful given that swizzling from index-offset is pretty cheap. But
> > the
> > >raw pointer version can't be validated safely and is problematic for
> e.g.
> > >Rust. Picking the index-offset version means that the external ecosystem
> > of
> > >columnar engines won't be that much closer aligned to Arrow than they
> are
> > >now.
> > >3. Implement the raw pointer variant as an extension type in C++ / C
> ABI.
> > >This seems potentially useful but given that it would likely be
> disfavored
> > >for data originating from Arrow-land, there would be fewer scenarios
> where
> > >zero-copy interop for strings is achieved
> > >
> > >This is difficult and I don't know what the best answer is, but
> personally
> > >my inclination has been toward choices that are utilitarian and help
> with
> > >alignment and cohesion in the open source ecosystem.
> > >
> > >- Wes
> > >
> > >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > >>
> > >> To make things clear, any of the factory functions listed below
> create a
> > >> type that maps exactly onto an Arrow columnar layout:
> > >>
> >
> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
> > >>
> > >> For example, calling `arrow::dictionary` creates a dictionary type
> that
> > >> exactly represents the dictionary layout specified in
> > >>
> > >>
> >
> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
> > >>
> > >> Similarly, if you use any of the builders listed below, what you will
> > >> get at the end is data that complies with the Arrow columnar
> > specification:
> > >> https://arrow.apache.org/docs/dev/cpp/api/builder.html
> > >>
> > >> All the core Arrow C++ APIs create and process data which complies
> with
> > >> the Arrow specification, and which is interoperable with other Arrow
> > >> implementations.
> > >>
> > >> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
> > >> etc.) goes through dedicated converters. There is no ambiguity.
> > >>
> > >>
> > >> Creating top-level utilities that create non-Arrow data introduces
> > >> confusion and ambiguity as to what Arrow is. Users who haven't studied
> > >> the spec in detail - which is probably most users of Arrow
> > >> implementations - will call `arrow::string_view(raw_pointers=true)`
> and
> > >> might later discover that their data cannot be shared with other
> > >> implementations (or, if it can, there will be an unsuspected
> conversion
> > >> cost at the edge).
> > >>
> > >> It also creates a risk of introducing a parallel Arrow-like ecosystem
> > >> based on the superset of data layouts understood by Arrow C++. People
> > >> may feel encouraged to code for that ecosystem, pessimizing
> > >> interoperability with non-C++ runtimes.
> > >>
> > >> Which is why I think those APIs, however convenient, also go against
> the
> > >> overarching goals of the Arrow project.
> > >>
> > >>
> > >> If we want to keep such convenience APIs as part of Arrow C++, they
> > >> should be clearly flagged as being non-Arrow compliant.
> > >>
> > >> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by
> > >> specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).
> > >>
> > >> But, they could be also be provided by a distinct library.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>
> > >> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
> > >> >
> > >> > Hi Ben,
> > >> >
> > >> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
> > >> >>
> > >> >> @Antoine
> > >> >>> What this PR is creating is an "unofficial" Arrow format, with
> data
> > >> >> types exposed in Arrow C++ that are not part of the Arrow standard,
> > but
> > >> >> are exposed as if they were.
> > >> >>
> > >> >> We already do this in every implementation of the arrow format I'm
> > >> >> aware of: it's more convenient to consider dictionary as a data
> type
> > >> >> even though the spec says that it is a field property.
> > >> >
> > >> > I'm not sure I understand your point. Dictionary encoding is part of
> > the
> > >> > Arrow spec, and considering it as a data type is an API choice that
> > does
> > >> > not violate the spec.
> > >> >
> > >> > Raw pointers in string views is just not an Arrow format.
> > >> >
> > >> > Regards
> > >> >
> > >> > Antoine.
> > >>
> >
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to