Re: [DISCUSS][Format] Draft implementation of string view array format

Benjamin Kietzman Wed, 17 May 2023 09:53:36 -0700

@Jacob
> You mention benchmarks multiple times, are these results published
somewhere?


I benchmarked the performance of raw pointer vs index offset views in my PR
to velox,
I do intend to port them to my arrow PR but I haven't gotten there yet.
Furthermore, it
seemed less urgent to me since coexistence of the two types in the c++
implementation
defers the question of how aggressively one should be preferred over the
other.

@Dewey
> I don't see the C Data interface in the PR

I have not addressed the C ABI in this PR. As you mention, it may be useful
to transmit
arrays with raw pointer views between implementations which allow them. I
can address
this in a follow up PR.

@Will
> If I understand correctly, multiple arrays can reference the same buffers
> in memory, but once they are written to IPC their data buffers will be
> duplicated. Is that right?
The buffers will be duplicated. If buffer duplication is becomes a concern,
I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated could be detected
by checking
pointer identity and written only once.


On Wed, May 17, 2023 at 12:07 AM Will Jones <will.jones...@gmail.com> wrote:

> Hello Ben,
>
> Thanks for your work on this. I think this will be an excellent addition to
> the format.
>
> If I understand correctly, multiple arrays can reference the same buffers
> in memory, but once they are written to IPC their data buffers will be
> duplicated. Is that right?
>
> Dictionary types have a special message so they can be reused across
> batches and even fields. Did we consider adding a similar message for
> string view buffers?
>
> One relevant use case I'm curious about is substring extraction. For
> example, if I have a column of URIs and I create columns where I've
> extracted substrings like the hostname, path, and a list column of query
> parameters, I'd like for those latter columns to be views into the URI
> buffers, rather than full copies.
>
> However, I've never touched the IPC read code paths, so it's quite possible
> I'm overlooking something obvious.
>
> Best,
>
> Will Jones
>
>
> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
> <de...@voltrondata.com.invalid> wrote:
>
> > Very cool!
> >
> > In addition to performance mentioned above, I could see this being
> > useful for the R bindings - we already have a global string pool and a
> > mechanism for keeping a vector of them alive.
> >
> > I don't see the C Data interface in the PR although I may have missed
> > it - is that a part of the proposal? It seems like it would be
> > possible to use raw pointers as long as they can be guaranteed to be
> > valid until the release callback is called?
> >
> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
> > <ja...@voltrondata.com.invalid> wrote:
> > >
> > > Hello Everyone,
> > > I think keeping interoperability with the large ecosystem is a very
> > > important goal for arrow so I am overall in favor of this proposal!
> > >
> > > You mention benchmarks multiple times, are these results published
> > > somewhere?
> > >
> > > Thanks
> > >
> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
> bengil...@gmail.com>
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> > > > compatible
> > > > "string view" type could bring several performance benefits to access
> > and
> > > > authoring of string data in the arrow format [2]. Additionally better
> > > > interoperability with engines already using this format could be
> > > > established.
> > > >
> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++
> implementation
> > and
> > > > to
> > > > the IPC format. For the purposes of IPC raw pointers are not used.
> > Instead,
> > > > each view contains a pair of 32 bit unsigned integers which encode
> the
> > > > index of
> > > > a character buffer (string view arrays may consist of a variable
> > number of
> > > > such buffers) and the offset of a view's data within that buffer
> > > > respectively.
> > > > Benefits of this substitution include:
> > > > - This makes explicit the guarantee that lifetime of all character
> > data is
> > > > equal
> > > >   to that of the array which views it, which is critical for
> confident
> > > >   consumption across an interface boundary.
> > > > - As with other types in the arrow format, such arrays are
> > serializable and
> > > >   venue agnostic; directly usable in shared memory without
> > modification.
> > > > - Indices and offsets are easily validated.
> > > >
> > > > Accessing the data requires some trivial pointer arithmetic, but in
> > > > benchmarking
> > > > this had negligible impact on sequential access and only minor impact
> > on
> > > > random
> > > > access.
> > > >
> > > > In the C++ implementation, raw pointer string views are supported as
> an
> > > > extended
> > > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
> > > > Branching on
> > > > this access pattern bit at the data type level has negligible impact
> on
> > > > performance since the branch resides outside any hot loops. Utility
> > > > functions
> > > > are provided for efficient (potentially in-place) conversion between
> > raw
> > > > pointer
> > > > and index offset views. For example, the C++ implementation could
> zero
> > copy
> > > > a raw pointer array from Velox, filter it, then convert to
> > index/offset for
> > > > serialization. Other implementations may choose to accommodate or
> > eschew
> > > > raw
> > > > pointer views as their communities direct.
> > > >
> > > > Where desirous in a rigorously controlled context this still enables
> > > > construction
> > > > and safe consumption of string view arrays which reference memory not
> > > > directly bound to the lifetime of the array. I'm not sure when or if
> we
> > > > would
> > > > find it useful to have arrays like this; I do not introduce any in
> > [3]. I
> > > > mention
> > > > this possibility to highlight that if benchmarking demonstrates that
> > such
> > > > an
> > > > approach brings a significant performance benefit to some operation,
> > the
> > > > only
> > > > barrier to its adoption would be code review. Likewise if more
> > intensive
> > > > benchmarking determines that raw pointer views critically outperform
> > > > index/offset
> > > > views for real-world analytics tasks, prioritizing raw pointer string
> > views
> > > > for usage within the C++ implementation will be straightforward.
> > > >
> > > > See also the proposal to Velox that their string view vector be
> > refactored
> > > > in a similar vein [4].
> > > >
> > > > Sincerely,
> > > > Ben Kietzman
> > > >
> > > > [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
> > > > [3] https://github.com/apache/arrow/pull/35628
> > > > [4] https://github.com/facebookincubator/velox/discussions/4362
> > > >
> >
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to