Re: [DISCUSS][C++] Raw pointer string views

Ian Cook Fri, 29 Sep 2023 11:11:05 -0700

I strongly agree with Ben's assertion that "the risk of a parallel
ecosystem… is more likely to be provoked by excluding a user's vital
use case [than by implementing support for an unofficial layout
variant]" in the C++ library. But there seems to be a consensus here
that there is a real risk of sowing confusion. Thank you Ben for your
readiness to consider the suggested approaches for reducing this risk.


I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.

Ian

On Thu, Sep 28, 2023 at 2:14 PM Felipe Oliveira Carvalho
<felipe...@gmail.com> wrote:
>
> My take here is that Ben did an excellent job in hiding the fact that C++
> has two variations of the format without leaking the pointer version via
> the interfaces through which Arrow arrays are communicated to other
> implementations.
>
> As things stand right now, there is no zero-copy transfer of pointer-based
> string views. Ben can give the final authoritative answer on this. The idea
> of zero-copy transfers was discussed but decided against to avoid adding a
> format to the spec that can't be implemented by languages that can't cast
> arbitrary memory bytes to objects (the case for many languages that are not
> C or C++).
>
> Having established that the spec is not "polluted" by a format that only
> systems-languages can implement, we can look at the constraint of keeping
> implementations completely faithful to the spec:
>
> Pros:
>  - The reference implementations serve as an alternative to the spec text
> in being a one-to-one translation of the spec
>
> Cons:
> - Performance loss (it's hard to predict how many optimizations can be lost
> by forcing an extra memory indirection when looping)
> - Insensibility to the ergonomics afforded by the language
>
> Variations are bound to happen any time a language doesn't afford good
> usability without conversions every time the data is used. In JavaScript,
> for instance, the use of UTF-16 is much more widespread than the use of
> UTF-8. It would make sense for a JavaScript implementations to keep string
> arrays in UTF-16 at rest.
>
> Sometimes software specs are accompanied by two types of implementations:
> the reference implementation that tries to be simple and didactic; and
> implementations used in practice because they are allowed to deviate
> internally, doing things in a more complicated way than the spec requires,
> to achieve some practical advantage. Are all the implementations in the
> apache/arrow of the first kind?
>
> --
> Felipe
>
> On Thu, Sep 28, 2023 at 1:10 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> > > What this PR is creating is an "unofficial" Arrow format, with data
> > types exposed in Arrow C++ that are not part of the Arrow standard, but
> > are exposed as if they were.
> >
> > I agree with Antoine here. It seems a pretty clear cut story of the C++
> > implementation doesn't follow the spec and thus we should either
> > 1.  Update the standard to allow raw pointers
> > 2.  fix the C++ implementation to not have them / treat them as though they
> > were
> >
> > If the core usecase is "arrow has the same in memory format used by DuckDB
> > and Velox, and those systems can't/won't change their implementations" it
> > seems like the only path forward for that usecase is to adopt their model
> > (raw pointers) directly. Maybe I am missing something
> >
> >
> > Andrew
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > > FWIW Rust wouldn't have issues using raw pointers, I can't speak for
> > other
> > > languages though. They would be more expensive to validate, but
> > validation
> > > is not going to be cheap regardless.
> > >
> > > I could definitely see a world where view types use pointers and IPC
> > > coerces to/from the large non-view types. IPC has to copy the string data
> > > regardless and re-encoding would avoid encoding masked data.
> > >
> > > The notion of supporting both is less of an exciting prospect... I'm also
> > > not sure if it is too late to make changes at this stage.
> > >
> > > On 28 September 2023 15:26:57 BST, Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >hi all,
> > > >
> > > >I'm just catching up on this thread after having taken a look at the
> > > format
> > > >PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
> > > >from having spent a great deal less time on this project than others.
> > > >
> > > >The original motivation I had for bringing up the idea of adding the
> > > >StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
> > > >format (though not necessarily the IPC format) was to provide a path for
> > > >zero-copy interoperability in some cases with these systems when dealing
> > > >with strings, and to enhance performance within Arrow-applications
> > > (setting
> > > >aside the external interop goal) in scenarios where being able to point
> > to
> > > >external memory spaces could avoid a copy-and-repack step. I think it's
> > > >useful to have an zero-copy IPC-compatible string format (i.e. what was
> > > >proposed and merged into Columnar.rst) for that allows for out-of-order
> > > >construction or arrays, reuse of memory (e.g. consider the case of
> > > decoding
> > > >dictionary encoding Parquet data — not having to copy strings many times
> > > >when rehydrating string arrays), and chunked allocation — all good
> > things
> > > >that the existing Arrow VarBinary layout does not provide for.
> > > >
> > > >For the in-memory side of things, I am somewhat more of Antoine's
> > > >perspective that trying to have both in-memory (index+offset and raw
> > > >pointers) creates a kind of uncanny valley situation that may confuse
> > > users
> > > >and cause other problems (especially if the raw pointer version is only
> > > >found in the C++ library). The raw pointer version also cannot be
> > > >validated, but I see validation as less of a requirement and more of a
> > > >"nice to have" (I realize others see validation as more of a
> > requirement).
> > > >
> > > >* I see the raw-pointer type has having more net utility (going back to
> > > the
> > > >original motivation), but I also see how it is problematic for some
> > > non-C++
> > > >implementations.
> > > >* The index-offset version is intrinsic value over the existing "dense"
> > > >varbinary layout (per some of the benefits above) but does not satisfy
> > the
> > > >external interoperability goal with systems that are becoming more
> > popular
> > > >month over month
> > > >* Incoming data from external systems that use the raw pointer model
> > have
> > > >to be serialized (and perhaps repacked) to the index-offset model. This
> > > >isn't ideal — going the other way (from index-offset to raw pointer) is
> > > >just a pointer swizzle, comparatively inexpensive.
> > > >
> > > >So it seems like we have several paths available, none of them wholly
> > > >satisfactory:
> > > >
> > > >1. Essentially what's in the existing PR — the raw pointer variant which
> > > is
> > > >"non-standard"
> > > >2. Pick one and only one for in memory — I think the raw pointer version
> > > is
> > > >more useful given that swizzling from index-offset is pretty cheap. But
> > > the
> > > >raw pointer version can't be validated safely and is problematic for
> > e.g.
> > > >Rust. Picking the index-offset version means that the external ecosystem
> > > of
> > > >columnar engines won't be that much closer aligned to Arrow than they
> > are
> > > >now.
> > > >3. Implement the raw pointer variant as an extension type in C++ / C
> > ABI.
> > > >This seems potentially useful but given that it would likely be
> > disfavored
> > > >for data originating from Arrow-land, there would be fewer scenarios
> > where
> > > >zero-copy interop for strings is achieved
> > > >
> > > >This is difficult and I don't know what the best answer is, but
> > personally
> > > >my inclination has been toward choices that are utilitarian and help
> > with
> > > >alignment and cohesion in the open source ecosystem.
> > > >
> > > >- Wes
> > > >
> > > >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > >>
> > > >> To make things clear, any of the factory functions listed below
> > create a
> > > >> type that maps exactly onto an Arrow columnar layout:
> > > >>
> > >
> > https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
> > > >>
> > > >> For example, calling `arrow::dictionary` creates a dictionary type
> > that
> > > >> exactly represents the dictionary layout specified in
> > > >>
> > > >>
> > >
> > https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
> > > >>
> > > >> Similarly, if you use any of the builders listed below, what you will
> > > >> get at the end is data that complies with the Arrow columnar
> > > specification:
> > > >> https://arrow.apache.org/docs/dev/cpp/api/builder.html
> > > >>
> > > >> All the core Arrow C++ APIs create and process data which complies
> > with
> > > >> the Arrow specification, and which is interoperable with other Arrow
> > > >> implementations.
> > > >>
> > > >> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
> > > >> etc.) goes through dedicated converters. There is no ambiguity.
> > > >>
> > > >>
> > > >> Creating top-level utilities that create non-Arrow data introduces
> > > >> confusion and ambiguity as to what Arrow is. Users who haven't studied
> > > >> the spec in detail - which is probably most users of Arrow
> > > >> implementations - will call `arrow::string_view(raw_pointers=true)`
> > and
> > > >> might later discover that their data cannot be shared with other
> > > >> implementations (or, if it can, there will be an unsuspected
> > conversion
> > > >> cost at the edge).
> > > >>
> > > >> It also creates a risk of introducing a parallel Arrow-like ecosystem
> > > >> based on the superset of data layouts understood by Arrow C++. People
> > > >> may feel encouraged to code for that ecosystem, pessimizing
> > > >> interoperability with non-C++ runtimes.
> > > >>
> > > >> Which is why I think those APIs, however convenient, also go against
> > the
> > > >> overarching goals of the Arrow project.
> > > >>
> > > >>
> > > >> If we want to keep such convenience APIs as part of Arrow C++, they
> > > >> should be clearly flagged as being non-Arrow compliant.
> > > >>
> > > >> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by
> > > >> specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).
> > > >>
> > > >> But, they could be also be provided by a distinct library.
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>
> > > >> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
> > > >> >
> > > >> > Hi Ben,
> > > >> >
> > > >> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
> > > >> >>
> > > >> >> @Antoine
> > > >> >>> What this PR is creating is an "unofficial" Arrow format, with
> > data
> > > >> >> types exposed in Arrow C++ that are not part of the Arrow standard,
> > > but
> > > >> >> are exposed as if they were.
> > > >> >>
> > > >> >> We already do this in every implementation of the arrow format I'm
> > > >> >> aware of: it's more convenient to consider dictionary as a data
> > type
> > > >> >> even though the spec says that it is a field property.
> > > >> >
> > > >> > I'm not sure I understand your point. Dictionary encoding is part of
> > > the
> > > >> > Arrow spec, and considering it as a data type is an API choice that
> > > does
> > > >> > not violate the spec.
> > > >> >
> > > >> > Raw pointers in string views is just not an Arrow format.
> > > >> >
> > > >> > Regards
> > > >> >
> > > >> > Antoine.
> > > >>
> > >
> >

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to