Re: [DISCUSS][Format] Draft implementation of string view array format

Will Jones Thu, 15 Jun 2023 10:00:55 -0700

Hi Ben,

It's exciting to see this move along.


The buffers will be duplicated. If buffer duplication is becomes a concern,
> I'd prefer to handle
> that in the ipc writer. Then buffers which are duplicated could be detected
> by checking
> pointer identity and written only once.


Question: to be able to write buffer only once and reference in multiple
arrays, does that require a change to the IPC format? Or is sharing buffers
within the same batch already allowed in the IPC format?

Best,

Will Jones

On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <[email protected]>
wrote:

> Hello again all,
>
> The PR [1] to add string view to the format and the C++ implementation is
> hovering around passing CI and has been undrafted. Furthermore, there is
> now also a PR [2] to add string view to the Go implementation. Code review
> is underway for each PR and I'd like to move toward a vote for acceptance-
> are there any other preliminaries which I've neglected?
>
> To reiterate the answers to some past questions:
> - Benchmarks are added in the C++ PR [1] to demonstrate the performance of
>   conversion between the various string formats. In addition, there are
>   some benchmarks which demonstrate the performance gains available with
>   the new format [3].
> - Adding string view to the C ABI is a natural follow up, but should be
>   handled independently. An issue has been added to track that
>   enhancement [4].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/35628
> [2] https://github.com/apache/arrow/pull/35769
> [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> [4] https://github.com/apache/arrow/issues/36099
>
> On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <[email protected]>
> wrote:
>
> > @Jacob
> > > You mention benchmarks multiple times, are these results published
> > somewhere?
> >
> > I benchmarked the performance of raw pointer vs index offset views in my
> > PR to velox,
> > I do intend to port them to my arrow PR but I haven't gotten there yet.
> > Furthermore, it
> > seemed less urgent to me since coexistence of the two types in the c++
> > implementation
> > defers the question of how aggressively one should be preferred over the
> > other.
> >
> > @Dewey
> > > I don't see the C Data interface in the PR
> >
> > I have not addressed the C ABI in this PR. As you mention, it may be
> > useful to transmit
> > arrays with raw pointer views between implementations which allow them. I
> > can address
> > this in a follow up PR.
> >
> > @Will
> > > If I understand correctly, multiple arrays can reference the same
> buffers
> > > in memory, but once they are written to IPC their data buffers will be
> > > duplicated. Is that right?
> > The buffers will be duplicated. If buffer duplication is becomes a
> > concern, I'd prefer to handle
> > that in the ipc writer. Then buffers which are duplicated could be
> > detected by checking
> > pointer identity and written only once.
> >
> >
> > On Wed, May 17, 2023 at 12:07 AM Will Jones <[email protected]>
> > wrote:
> >
> >> Hello Ben,
> >>
> >> Thanks for your work on this. I think this will be an excellent addition
> >> to
> >> the format.
> >>
> >> If I understand correctly, multiple arrays can reference the same
> buffers
> >> in memory, but once they are written to IPC their data buffers will be
> >> duplicated. Is that right?
> >>
> >> Dictionary types have a special message so they can be reused across
> >> batches and even fields. Did we consider adding a similar message for
> >> string view buffers?
> >>
> >> One relevant use case I'm curious about is substring extraction. For
> >> example, if I have a column of URIs and I create columns where I've
> >> extracted substrings like the hostname, path, and a list column of query
> >> parameters, I'd like for those latter columns to be views into the URI
> >> buffers, rather than full copies.
> >>
> >> However, I've never touched the IPC read code paths, so it's quite
> >> possible
> >> I'm overlooking something obvious.
> >>
> >> Best,
> >>
> >> Will Jones
> >>
> >>
> >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
> >> <[email protected]> wrote:
> >>
> >> > Very cool!
> >> >
> >> > In addition to performance mentioned above, I could see this being
> >> > useful for the R bindings - we already have a global string pool and a
> >> > mechanism for keeping a vector of them alive.
> >> >
> >> > I don't see the C Data interface in the PR although I may have missed
> >> > it - is that a part of the proposal? It seems like it would be
> >> > possible to use raw pointers as long as they can be guaranteed to be
> >> > valid until the release callback is called?
> >> >
> >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
> >> > <[email protected]> wrote:
> >> > >
> >> > > Hello Everyone,
> >> > > I think keeping interoperability with the large ecosystem is a very
> >> > > important goal for arrow so I am overall in favor of this proposal!
> >> > >
> >> > > You mention benchmarks multiple times, are these results published
> >> > > somewhere?
> >> > >
> >> > > Thanks
> >> > >
> >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
> >> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hello all,
> >> > > >
> >> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> >> > > > compatible
> >> > > > "string view" type could bring several performance benefits to
> >> access
> >> > and
> >> > > > authoring of string data in the arrow format [2]. Additionally
> >> better
> >> > > > interoperability with engines already using this format could be
> >> > > > established.
> >> > > >
> >> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++
> >> implementation
> >> > and
> >> > > > to
> >> > > > the IPC format. For the purposes of IPC raw pointers are not used.
> >> > Instead,
> >> > > > each view contains a pair of 32 bit unsigned integers which encode
> >> the
> >> > > > index of
> >> > > > a character buffer (string view arrays may consist of a variable
> >> > number of
> >> > > > such buffers) and the offset of a view's data within that buffer
> >> > > > respectively.
> >> > > > Benefits of this substitution include:
> >> > > > - This makes explicit the guarantee that lifetime of all character
> >> > data is
> >> > > > equal
> >> > > >   to that of the array which views it, which is critical for
> >> confident
> >> > > >   consumption across an interface boundary.
> >> > > > - As with other types in the arrow format, such arrays are
> >> > serializable and
> >> > > >   venue agnostic; directly usable in shared memory without
> >> > modification.
> >> > > > - Indices and offsets are easily validated.
> >> > > >
> >> > > > Accessing the data requires some trivial pointer arithmetic, but
> in
> >> > > > benchmarking
> >> > > > this had negligible impact on sequential access and only minor
> >> impact
> >> > on
> >> > > > random
> >> > > > access.
> >> > > >
> >> > > > In the C++ implementation, raw pointer string views are supported
> >> as an
> >> > > > extended
> >> > > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
> >> > > > Branching on
> >> > > > this access pattern bit at the data type level has negligible
> >> impact on
> >> > > > performance since the branch resides outside any hot loops.
> Utility
> >> > > > functions
> >> > > > are provided for efficient (potentially in-place) conversion
> between
> >> > raw
> >> > > > pointer
> >> > > > and index offset views. For example, the C++ implementation could
> >> zero
> >> > copy
> >> > > > a raw pointer array from Velox, filter it, then convert to
> >> > index/offset for
> >> > > > serialization. Other implementations may choose to accommodate or
> >> > eschew
> >> > > > raw
> >> > > > pointer views as their communities direct.
> >> > > >
> >> > > > Where desirous in a rigorously controlled context this still
> enables
> >> > > > construction
> >> > > > and safe consumption of string view arrays which reference memory
> >> not
> >> > > > directly bound to the lifetime of the array. I'm not sure when or
> >> if we
> >> > > > would
> >> > > > find it useful to have arrays like this; I do not introduce any in
> >> > [3]. I
> >> > > > mention
> >> > > > this possibility to highlight that if benchmarking demonstrates
> that
> >> > such
> >> > > > an
> >> > > > approach brings a significant performance benefit to some
> operation,
> >> > the
> >> > > > only
> >> > > > barrier to its adoption would be code review. Likewise if more
> >> > intensive
> >> > > > benchmarking determines that raw pointer views critically
> outperform
> >> > > > index/offset
> >> > > > views for real-world analytics tasks, prioritizing raw pointer
> >> string
> >> > views
> >> > > > for usage within the C++ implementation will be straightforward.
> >> > > >
> >> > > > See also the proposal to Velox that their string view vector be
> >> > refactored
> >> > > > in a similar vein [4].
> >> > > >
> >> > > > Sincerely,
> >> > > > Ben Kietzman
> >> > > >
> >> > > > [1]
> >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
> >> > > > [3] https://github.com/apache/arrow/pull/35628
> >> > > > [4] https://github.com/facebookincubator/velox/discussions/4362
> >> > > >
> >> >
> >>
> >
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to