Re: [DISCUSS][Format] Draft implementation of string view array format

Benjamin Kietzman Wed, 21 Jun 2023 05:43:56 -0700

 > Ben, at one point there was some discussion that this might be a c-data
> only type.  However, I believe that was based on the raw pointers
> representation.  What you've proposed here, if I understand correctly, is
> an index + offsets representation and it is suitable for IPC correct?
> (e.g. I see that you have changes and examples in the IPC reader/writer)


Yes, what I'm proposing here is an addition to the IPC format which
necessarily
does not reference raw pointers. This is also what's implemented in the
corresponding Go PR [1].

Since other engines which implement this data type *do* use raw pointers,
the c++ implementation can accept raw pointer string view arrays for
purposes
of validation and conversion to the IPC-compatible format.

[1] https://github.com/apache/arrow/pull/35769

On Tue, Jun 20, 2023 at 11:33 AM Weston Pace <[email protected]> wrote:

> Before I say anything else I'll say that I am in favor of this new layout.
> There is some existing literature on the idea (e.g. umbra) and your
> benchmarks show some nice improvements.
>
> Compared to some of the other layouts we've discussed recently (REE, list
> veiw) I do think this layout is more unique and fundamentally different.
> Perhaps most fundamentally different:
>
>  * This is the first layout where the number of buffers depends on the data
> and not the schema.  I think this is the most architecturally significant
> fact.  It does require a (backwards compatible) change to the IPC format
> itself, beyond just adding new type codes.  It also poses challenges in
> places where we've assumed there will be at most 3 buffers (e.g. in
> ArraySpan, though, as you have shown, we can work around this using a raw
> pointers representation internally in those spots).
>
> I think you've done some great work to integrate this well with Arrow-C++
> and I'm convinced it can work.
>
> I would be interested in hearing some input from the Rust community.
>
> Ben, at one point there was some discussion that this might be a c-data
> only type.  However, I believe that was based on the raw pointers
> representation.  What you've proposed here, if I understand correctly, is
> an index + offsets representation and it is suitable for IPC correct?
> (e.g. I see that you have changes and examples in the IPC reader/writer)
>
> On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <[email protected]>
> wrote:
>
> > Hi Gang,
> >
> > I'm not sure what you mean, sorry if my answers are off base:
> >
> > Parquet's ByteArray will be unaffected by the addition of the string view
> > type;
> > all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
> > with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> > during serialization to parquet [1].
> >
> > If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
> > as encoding of equivalent arrow::Type::STRING, that's something I haven't
> > benchmarked so I can't answer definitively. I would expect it to be
> faster
> > than
> > first converting STRING_VIEW->STRING then encoding to parquet; direct
> > encoding avoids allocating and populating temporary buffers. Of course
> this
> > only applies to cases where you need to encode an array of STRING_VIEW to
> > parquet- encoding of STRING to parquet will be unaffected.
> >
> > Sincerely,
> > Ben
> >
> > [1]
> >
> >
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
> >
> > On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <[email protected]> wrote:
> >
> > > Hi Ben,
> > >
> > > The posted benchmark [1] looks pretty good to me. However, I want to
> > > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> > > uses a customized parquet::ByteArray type [2] for string/binary, I
> would
> > > expect some regression of conversions between parquet reader/writer
> > > and the proposed string view array, especially when some strings use
> > > short form and others use long form.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> > > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Jun 16, 2023 at 3:58 AM Will Jones <[email protected]>
> > > wrote:
> > >
> > > > Cool. Thanks for doing that!
> > > >
> > > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I've added https://github.com/apache/arrow/issues/36112 to track
> > > > > deduplication of buffers on write.
> > > > > I don't think it would require modification of the IPC format.
> > > > >
> > > > > Ben
> > > > >
> > > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Based on my understanding, in theory a buffer *could* be shared
> > > within
> > > > a
> > > > > > batch since the flatbuffers message just uses an offset and
> length
> > to
> > > > > > identify the buffers.
> > > > > >
> > > > > > That said, I don't believe any current implementation actually
> does
> > > > this
> > > > > or
> > > > > > takes advantage of this in any meaningful way.
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ben,
> > > > > > >
> > > > > > > It's exciting to see this move along.
> > > > > > >
> > > > > > > The buffers will be duplicated. If buffer duplication is
> becomes
> > a
> > > > > > concern,
> > > > > > > > I'd prefer to handle
> > > > > > > > that in the ipc writer. Then buffers which are duplicated
> could
> > > be
> > > > > > > detected
> > > > > > > > by checking
> > > > > > > > pointer identity and written only once.
> > > > > > >
> > > > > > >
> > > > > > > Question: to be able to write buffer only once and reference in
> > > > > multiple
> > > > > > > arrays, does that require a change to the IPC format? Or is
> > sharing
> > > > > > buffers
> > > > > > > within the same batch already allowed in the IPC format?
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Will Jones
> > > > > > >
> > > > > > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello again all,
> > > > > > > >
> > > > > > > > The PR [1] to add string view to the format and the C++
> > > > > implementation
> > > > > > is
> > > > > > > > hovering around passing CI and has been undrafted.
> Furthermore,
> > > > there
> > > > > > is
> > > > > > > > now also a PR [2] to add string view to the Go
> implementation.
> > > Code
> > > > > > > review
> > > > > > > > is underway for each PR and I'd like to move toward a vote
> for
> > > > > > > acceptance-
> > > > > > > > are there any other preliminaries which I've neglected?
> > > > > > > >
> > > > > > > > To reiterate the answers to some past questions:
> > > > > > > > - Benchmarks are added in the C++ PR [1] to demonstrate the
> > > > > performance
> > > > > > > of
> > > > > > > >   conversion between the various string formats. In addition,
> > > there
> > > > > are
> > > > > > > >   some benchmarks which demonstrate the performance gains
> > > available
> > > > > > with
> > > > > > > >   the new format [3].
> > > > > > > > - Adding string view to the C ABI is a natural follow up, but
> > > > should
> > > > > be
> > > > > > > >   handled independently. An issue has been added to track
> that
> > > > > > > >   enhancement [4].
> > > > > > > >
> > > > > > > > Sincerely,
> > > > > > > > Ben Kietzman
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/arrow/pull/35628
> > > > > > > > [2] https://github.com/apache/arrow/pull/35769
> > > > > > > > [3]
> > > > > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > > > > > > [4] https://github.com/apache/arrow/issues/36099
> > > > > > > >
> > > > > > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <
> > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > @Jacob
> > > > > > > > > > You mention benchmarks multiple times, are these results
> > > > > published
> > > > > > > > > somewhere?
> > > > > > > > >
> > > > > > > > > I benchmarked the performance of raw pointer vs index
> offset
> > > > views
> > > > > in
> > > > > > > my
> > > > > > > > > PR to velox,
> > > > > > > > > I do intend to port them to my arrow PR but I haven't
> gotten
> > > > there
> > > > > > yet.
> > > > > > > > > Furthermore, it
> > > > > > > > > seemed less urgent to me since coexistence of the two types
> > in
> > > > the
> > > > > > c++
> > > > > > > > > implementation
> > > > > > > > > defers the question of how aggressively one should be
> > preferred
> > > > > over
> > > > > > > the
> > > > > > > > > other.
> > > > > > > > >
> > > > > > > > > @Dewey
> > > > > > > > > > I don't see the C Data interface in the PR
> > > > > > > > >
> > > > > > > > > I have not addressed the C ABI in this PR. As you mention,
> it
> > > may
> > > > > be
> > > > > > > > > useful to transmit
> > > > > > > > > arrays with raw pointer views between implementations which
> > > allow
> > > > > > > them. I
> > > > > > > > > can address
> > > > > > > > > this in a follow up PR.
> > > > > > > > >
> > > > > > > > > @Will
> > > > > > > > > > If I understand correctly, multiple arrays can reference
> > the
> > > > same
> > > > > > > > buffers
> > > > > > > > > > in memory, but once they are written to IPC their data
> > > buffers
> > > > > will
> > > > > > > be
> > > > > > > > > > duplicated. Is that right?
> > > > > > > > > The buffers will be duplicated. If buffer duplication is
> > > becomes
> > > > a
> > > > > > > > > concern, I'd prefer to handle
> > > > > > > > > that in the ipc writer. Then buffers which are duplicated
> > could
> > > > be
> > > > > > > > > detected by checking
> > > > > > > > > pointer identity and written only once.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, May 17, 2023 at 12:07 AM Will Jones <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hello Ben,
> > > > > > > > >>
> > > > > > > > >> Thanks for your work on this. I think this will be an
> > > excellent
> > > > > > > addition
> > > > > > > > >> to
> > > > > > > > >> the format.
> > > > > > > > >>
> > > > > > > > >> If I understand correctly, multiple arrays can reference
> the
> > > > same
> > > > > > > > buffers
> > > > > > > > >> in memory, but once they are written to IPC their data
> > buffers
> > > > > will
> > > > > > be
> > > > > > > > >> duplicated. Is that right?
> > > > > > > > >>
> > > > > > > > >> Dictionary types have a special message so they can be
> > reused
> > > > > across
> > > > > > > > >> batches and even fields. Did we consider adding a similar
> > > > message
> > > > > > for
> > > > > > > > >> string view buffers?
> > > > > > > > >>
> > > > > > > > >> One relevant use case I'm curious about is substring
> > > extraction.
> > > > > For
> > > > > > > > >> example, if I have a column of URIs and I create columns
> > where
> > > > > I've
> > > > > > > > >> extracted substrings like the hostname, path, and a list
> > > column
> > > > of
> > > > > > > query
> > > > > > > > >> parameters, I'd like for those latter columns to be views
> > into
> > > > the
> > > > > > URI
> > > > > > > > >> buffers, rather than full copies.
> > > > > > > > >>
> > > > > > > > >> However, I've never touched the IPC read code paths, so
> it's
> > > > quite
> > > > > > > > >> possible
> > > > > > > > >> I'm overlooking something obvious.
> > > > > > > > >>
> > > > > > > > >> Best,
> > > > > > > > >>
> > > > > > > > >> Will Jones
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
> > > > > > > > >> <[email protected]> wrote:
> > > > > > > > >>
> > > > > > > > >> > Very cool!
> > > > > > > > >> >
> > > > > > > > >> > In addition to performance mentioned above, I could see
> > this
> > > > > being
> > > > > > > > >> > useful for the R bindings - we already have a global
> > string
> > > > pool
> > > > > > > and a
> > > > > > > > >> > mechanism for keeping a vector of them alive.
> > > > > > > > >> >
> > > > > > > > >> > I don't see the C Data interface in the PR although I
> may
> > > have
> > > > > > > missed
> > > > > > > > >> > it - is that a part of the proposal? It seems like it
> > would
> > > be
> > > > > > > > >> > possible to use raw pointers as long as they can be
> > > guaranteed
> > > > > to
> > > > > > be
> > > > > > > > >> > valid until the release callback is called?
> > > > > > > > >> >
> > > > > > > > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
> > > > > > > > >> > <[email protected]> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > Hello Everyone,
> > > > > > > > >> > > I think keeping interoperability with the large
> > ecosystem
> > > > is a
> > > > > > > very
> > > > > > > > >> > > important goal for arrow so I am overall in favor of
> > this
> > > > > > > proposal!
> > > > > > > > >> > >
> > > > > > > > >> > > You mention benchmarks multiple times, are these
> results
> > > > > > published
> > > > > > > > >> > > somewhere?
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks
> > > > > > > > >> > >
> > > > > > > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
> > > > > > > > >> [email protected]>
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Hello all,
> > > > > > > > >> > > >
> > > > > > > > >> > > > As previously discussed on this list [1], an
> > > > > > > UmbraDB/DuckDB/Velox
> > > > > > > > >> > > > compatible
> > > > > > > > >> > > > "string view" type could bring several performance
> > > > benefits
> > > > > to
> > > > > > > > >> access
> > > > > > > > >> > and
> > > > > > > > >> > > > authoring of string data in the arrow format [2].
> > > > > Additionally
> > > > > > > > >> better
> > > > > > > > >> > > > interoperability with engines already using this
> > format
> > > > > could
> > > > > > be
> > > > > > > > >> > > > established.
> > > > > > > > >> > > >
> > > > > > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the
> > C++
> > > > > > > > >> implementation
> > > > > > > > >> > and
> > > > > > > > >> > > > to
> > > > > > > > >> > > > the IPC format. For the purposes of IPC raw pointers
> > are
> > > > not
> > > > > > > used.
> > > > > > > > >> > Instead,
> > > > > > > > >> > > > each view contains a pair of 32 bit unsigned
> integers
> > > > which
> > > > > > > encode
> > > > > > > > >> the
> > > > > > > > >> > > > index of
> > > > > > > > >> > > > a character buffer (string view arrays may consist
> of
> > a
> > > > > > variable
> > > > > > > > >> > number of
> > > > > > > > >> > > > such buffers) and the offset of a view's data within
> > > that
> > > > > > buffer
> > > > > > > > >> > > > respectively.
> > > > > > > > >> > > > Benefits of this substitution include:
> > > > > > > > >> > > > - This makes explicit the guarantee that lifetime of
> > all
> > > > > > > character
> > > > > > > > >> > data is
> > > > > > > > >> > > > equal
> > > > > > > > >> > > >   to that of the array which views it, which is
> > critical
> > > > for
> > > > > > > > >> confident
> > > > > > > > >> > > >   consumption across an interface boundary.
> > > > > > > > >> > > > - As with other types in the arrow format, such
> arrays
> > > are
> > > > > > > > >> > serializable and
> > > > > > > > >> > > >   venue agnostic; directly usable in shared memory
> > > without
> > > > > > > > >> > modification.
> > > > > > > > >> > > > - Indices and offsets are easily validated.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Accessing the data requires some trivial pointer
> > > > arithmetic,
> > > > > > but
> > > > > > > > in
> > > > > > > > >> > > > benchmarking
> > > > > > > > >> > > > this had negligible impact on sequential access and
> > only
> > > > > minor
> > > > > > > > >> impact
> > > > > > > > >> > on
> > > > > > > > >> > > > random
> > > > > > > > >> > > > access.
> > > > > > > > >> > > >
> > > > > > > > >> > > > In the C++ implementation, raw pointer string views
> > are
> > > > > > > supported
> > > > > > > > >> as an
> > > > > > > > >> > > > extended
> > > > > > > > >> > > > case of the Utf8View type:
> > > > > > > `utf8_view(/*has_raw_pointers=*/true)`.
> > > > > > > > >> > > > Branching on
> > > > > > > > >> > > > this access pattern bit at the data type level has
> > > > > negligible
> > > > > > > > >> impact on
> > > > > > > > >> > > > performance since the branch resides outside any hot
> > > > loops.
> > > > > > > > Utility
> > > > > > > > >> > > > functions
> > > > > > > > >> > > > are provided for efficient (potentially in-place)
> > > > conversion
> > > > > > > > between
> > > > > > > > >> > raw
> > > > > > > > >> > > > pointer
> > > > > > > > >> > > > and index offset views. For example, the C++
> > > > implementation
> > > > > > > could
> > > > > > > > >> zero
> > > > > > > > >> > copy
> > > > > > > > >> > > > a raw pointer array from Velox, filter it, then
> > convert
> > > to
> > > > > > > > >> > index/offset for
> > > > > > > > >> > > > serialization. Other implementations may choose to
> > > > > accommodate
> > > > > > > or
> > > > > > > > >> > eschew
> > > > > > > > >> > > > raw
> > > > > > > > >> > > > pointer views as their communities direct.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Where desirous in a rigorously controlled context
> this
> > > > still
> > > > > > > > enables
> > > > > > > > >> > > > construction
> > > > > > > > >> > > > and safe consumption of string view arrays which
> > > reference
> > > > > > > memory
> > > > > > > > >> not
> > > > > > > > >> > > > directly bound to the lifetime of the array. I'm not
> > > sure
> > > > > when
> > > > > > > or
> > > > > > > > >> if we
> > > > > > > > >> > > > would
> > > > > > > > >> > > > find it useful to have arrays like this; I do not
> > > > introduce
> > > > > > any
> > > > > > > in
> > > > > > > > >> > [3]. I
> > > > > > > > >> > > > mention
> > > > > > > > >> > > > this possibility to highlight that if benchmarking
> > > > > > demonstrates
> > > > > > > > that
> > > > > > > > >> > such
> > > > > > > > >> > > > an
> > > > > > > > >> > > > approach brings a significant performance benefit to
> > > some
> > > > > > > > operation,
> > > > > > > > >> > the
> > > > > > > > >> > > > only
> > > > > > > > >> > > > barrier to its adoption would be code review.
> Likewise
> > > if
> > > > > more
> > > > > > > > >> > intensive
> > > > > > > > >> > > > benchmarking determines that raw pointer views
> > > critically
> > > > > > > > outperform
> > > > > > > > >> > > > index/offset
> > > > > > > > >> > > > views for real-world analytics tasks, prioritizing
> raw
> > > > > pointer
> > > > > > > > >> string
> > > > > > > > >> > views
> > > > > > > > >> > > > for usage within the C++ implementation will be
> > > > > > straightforward.
> > > > > > > > >> > > >
> > > > > > > > >> > > > See also the proposal to Velox that their string
> view
> > > > vector
> > > > > > be
> > > > > > > > >> > refactored
> > > > > > > > >> > > > in a similar vein [4].
> > > > > > > > >> > > >
> > > > > > > > >> > > > Sincerely,
> > > > > > > > >> > > > Ben Kietzman
> > > > > > > > >> > > >
> > > > > > > > >> > > > [1]
> > > > > > > > >>
> > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > > > > > >> > > > [2]
> > > > > http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
> > > > > > > > >> > > > [3] https://github.com/apache/arrow/pull/35628
> > > > > > > > >> > > > [4]
> > > > > > https://github.com/facebookincubator/velox/discussions/4362
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to