Re: [DISCUSS][C++] Raw pointer string views

Weston Pace Fri, 06 Oct 2023 08:47:33 -0700

> I feel the broader question here is what is Arrow's intended use case -
interchange or execution


The line between interchange and execution is not always clear.  For
example, I think we would like Arrow to be considered as a standard for UDF
libraries.

On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt <m...@duckdblabs.com> wrote:

> For the index vs pointer question - DuckDB went with pointers as they are
> more flexible, and DuckDB was designed to consume data (and strings) from a
> wide variety of formats in a wide variety of languages. Pointers allows us
> to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> etc. The flip side of pointers is that ownership has to be handled very
> carefully. Our vector format is an execution-only format, and never leaves
> the internals of the engine. This greatly simplifies ownership as we are in
> complete control of what happens inside the engine. For an interchange
> format that is intended for handing data between engines, I can see this
> being more complicated and having verification being more important.
>
> As for the actual change:
>
> From an interchange perspective from DuckDB's side - the proposed
> zero-copy integration would definitely speed up the conversion of DuckDB
> string vectors to Arrow string vectors. In a recent benchmark that we have
> performed we have found string conversion to Arrow vectors to be a
> bottleneck in certain workloads, although we have not sufficiently
> researched if this could be improved in other ways. It is possible this can
> be alleviated without requiring changes to Arrow.
>
> However - in general, a new string vector format is only useful if
> consumers also support the format. If the consumer immediately converts the
> strings back into the standard Arrow string representation then there is no
> benefit. The change will only move where the conversion happens (from
> inside DuckDB to inside the consumer). As such, this change is only useful
> if the broader Arrow ecosystem moves towards supporting the new string
> format.
>
> From an execution perspective from DuckDB's side - it is unlikely that we
> will switch to using Arrow as an internal format at this stage of the
> project. While this change increases Arrow's utility as an intermediate
> execution format, that is more relevant to projects that are currently
> using Arrow in this manner or are planning to use Arrow in this manner.
>
> I feel the broader question here is what is Arrow's intended use case -
> interchange or execution - as they are opposed in this discussion. This
> change improves Arrow's utility as an execution format at the expense of
> more stability in the interchange format. From my perspective Arrow is more
> useful as an interchange format. When different tools communicate with each
> other a standard is required. An execution format is generally not exposed
> outside of the internals of the execution engine. Engines can do whatever
> they want here - and a standard is perhaps not as useful.
>
> On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > I don't think "we have to adjust the Arrow format so that existing
> > > internal representations become Arrow-compliant without any
> > > (re-)implementation effort" is a reasonable design principle.
> >
> > I agree with this statement from Antoine -- given the Arrow community has
> > standardized an addition to the format with StringView, I think it would
> > help to get some input from those at DuckDB and Velox on their
> perspective
> >
> > Andrew
> >
> >
> >
> >
> > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> > <r....@googlemail.com.invalid> wrote:
> >
> > > Oh I'm with you on it being a precedent we want to be very careful
> about
> > > setting, but if there isn't a meaningful performance difference, we may
> > > be able to sidestep that discussion entirely.
> > >
> > > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > > >
> > > > Even if performance were significant better, I don't think it's a
> good
> > > > enough reason to add these representations to Arrow. By construction,
> > > > a standard cannot continuously chase the performance state of art, it
> > > > has to weigh the benefits of performance improvements against the
> > > > increased cost for the ecosystem (for example the cost of adapting to
> > > > frequent standard changes and a growing standard size).
> > > >
> > > > We have extension types which could reasonably be used for
> > > > non-standard data types, especially the kind that are motivated by
> > > > leading-edge performance research and innovation and come with
> unusual
> > > > constraints (such as requiring trusting and dereferencing raw
> pointers
> > > > embedded in data buffers). There could even be an argument for making
> > > > some of them canonical extension types if there's enough anteriority
> > > > in favor.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > > >> I think what would really help would be some concrete numbers, do we
> > > >> have any numbers comparing the performance of the offset and pointer
> > > >> based representations? If there isn't a significant performance
> > > >> difference between them, would the systems that currently use a
> > > >> pointer-based approach be willing to meet us in the middle and
> switch to
> > > >> an offset based encoding? This to me feels like it would be the best
> > > >> outcome for the ecosystem as a whole.
> > > >>
> > > >> Kind Regards,
> > > >>
> > > >> Raphael
> > > >>
> > > >> On 02/10/2023 13:50, Antoine Pitrou wrote:
> > > >>>
> > > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
> > > >>>>>
> > > >>>>> I would also assert that another way to reduce this risk is to
> add
> > > >>>>> some prose to the relevant sections of the columnar format
> > > >>>>> specification doc to clearly explain that a raw pointers variant
> of
> > > >>>>> the layout, while not part of the official spec, may be
> > > >>>>> implemented in
> > > >>>>> some Arrow libraries.
> > > >>>>
> > > >>>> I've lost a little context but on all the concerns of adding raw
> > > >>>> pointers
> > > >>>> as an official option to the spec.  But I see making raw-pointer
> > > >>>> variants
> > > >>>> the best path forward.
> > > >>>>
> > > >>>> Things captured from this thread or seem obvious at least to me:
> > > >>>> 1.  Divergence of IPC spec from in-memory/C-ABI spec?
> > > >>>> 2.  More parts of the spec to cover.
> > > >>>> 3.  In-compatibility with some languages
> > > >>>> 4.  Validation (in my mind different use-cases require different
> > > >>>> levels of
> > > >>>> validation, so this is a little bit less of a concern in my mind).
> > > >>>>
> > > >>>> I think the broader issue is how we think about compatibility with
> > > >>>> other
> > > >>>> systems.  For instance, what happens if Velox and DuckDb start
> adding
> > > >>>> new
> > > >>>> divergent memory layouts?  Are we expecting to add them to the
> spec?
> > > >>>
> > > >>> This is a slippery slope. The more Arrow has a policy of
> integrating
> > > >>> existing practices simply because they exist, the more the Arrow
> > > >>> format will become _à la carte_, with different implementations
> > > >>> choosing to implement whatever they want to spend their engineering
> > > >>> effort on (you can see this occur, in part, on the Parquet format
> with
> > > >>> its many different encodings, compression algorithms and a 96-bit
> > > >>> timestamp type).
> > > >>>
> > > >>> We _have_ to think carefully about the middle- and long-term
> future of
> > > >>> the format when adopting new features.
> > > >>>
> > > >>> In this instance, we are doing a large part of the effort by
> adopting
> > > >>> a string view format with variadic buffers, inlined prefixes and
> > > >>> offset-based views into those buffers. But some implementations
> with
> > > >>> historically different internal representations will have to share
> > > >>> part of the effort to align with the newly standardized format.
> > > >>>
> > > >>> I don't think "we have to adjust the Arrow format so that existing
> > > >>> internal representations become Arrow-compliant without any
> > > >>> (re-)implementation effort" is a reasonable design principle.
> > > >>>
> > > >>> Regards
> > > >>>
> > > >>> Antoine.
> > >
> >

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to