Re: [DISCUSS][Format] Draft implementation of string view array format

Antoine Pitrou Wed, 21 Jun 2023 05:55:39 -0700

I hope implementations don't start exposing non-standard datatypes overthe C Data Interface (apart from extension types, of course). I wouldalso be wary of exposing non-standard datatypes in the official ArrowC++ implementation.


Regards

Antoine.


Le 21/06/2023 à 14:27, Benjamin Kietzman a écrit :

  > Ben, at one point there was some discussion that this might be a c-data

only type.  However, I believe that was based on the raw pointers
representation.  What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)


Yes, what I'm proposing here is an addition to the IPC format which
necessarily
does not reference raw pointers. This is also what's implemented in the
corresponding Go PR [1].

Since other engines which implement this data type *do* use raw pointers,
the c++ implementation can accept raw pointer string view arrays for
purposes
of validation and conversion to the IPC-compatible format.

[1] https://github.com/apache/arrow/pull/35769

On Tue, Jun 20, 2023 at 11:33 AM Weston Pace <[email protected]> wrote:

Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements.

Compared to some of the other layouts we've discussed recently (REE, list
veiw) I do think this layout is more unique and fundamentally different.
Perhaps most fundamentally different:

  * This is the first layout where the number of buffers depends on the data
and not the schema.  I think this is the most architecturally significant
fact.  It does require a (backwards compatible) change to the IPC format
itself, beyond just adding new type codes.  It also poses challenges in
places where we've assumed there will be at most 3 buffers (e.g. in
ArraySpan, though, as you have shown, we can work around this using a raw
pointers representation internally in those spots).

I think you've done some great work to integrate this well with Arrow-C++
and I'm convinced it can work.

I would be interested in hearing some input from the Rust community.

Ben, at one point there was some discussion that this might be a c-data
only type.  However, I believe that was based on the raw pointers
representation.  What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)

On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <[email protected]>
wrote:

Hi Gang,

I'm not sure what you mean, sorry if my answers are off base:

Parquet's ByteArray will be unaffected by the addition of the string view
type;
all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
during serialization to parquet [1].

If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
as encoding of equivalent arrow::Type::STRING, that's something I haven't
benchmarked so I can't answer definitively. I would expect it to be

faster

than
first converting STRING_VIEW->STRING then encoding to parquet; direct
encoding avoids allocating and populating temporary buffers. Of course

this

only applies to cases where you need to encode an array of STRING_VIEW to
parquet- encoding of STRING to parquet will be unaffected.

Sincerely,
Ben

[1]

https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179


On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <[email protected]> wrote:

Hi Ben,

The posted benchmark [1] looks pretty good to me. However, I want to
raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
uses a customized parquet::ByteArray type [2] for string/binary, I

would

expect some regression of conversions between parquet reader/writer
and the proposed string view array, especially when some strings use
short form and others use long form.

[1]

https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575

[2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617

Best,
Gang

On Fri, Jun 16, 2023 at 3:58 AM Will Jones <[email protected]>
wrote:

Cool. Thanks for doing that!

On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <[email protected]

wrote:

I've added https://github.com/apache/arrow/issues/36112 to track
deduplication of buffers on write.
I don't think it would require modification of the IPC format.

Ben

On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <[email protected]

wrote:

Based on my understanding, in theory a buffer *could* be shared

within

batch since the flatbuffers message just uses an offset and

length

to

identify the buffers.

That said, I don't believe any current implementation actually

does

this

or

takes advantage of this in any meaningful way.

--Matt

On Thu, Jun 15, 2023 at 1:00 PM Will Jones <

[email protected]>

wrote:

Hi Ben,

It's exciting to see this move along.

The buffers will be duplicated. If buffer duplication is

becomes

concern,

I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated

could

be

detected

by checking
pointer identity and written only once.



Question: to be able to write buffer only once and reference in

multiple

arrays, does that require a change to the IPC format? Or is

sharing

buffers

within the same batch already allowed in the IPC format?

Best,

Will Jones

On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <

[email protected]

wrote:

Hello again all,

The PR [1] to add string view to the format and the C++

implementation

is

hovering around passing CI and has been undrafted.

Furthermore,

there

is

now also a PR [2] to add string view to the Go

implementation.

Code

review

is underway for each PR and I'd like to move toward a vote

for

acceptance-

are there any other preliminaries which I've neglected?

To reiterate the answers to some past questions:
- Benchmarks are added in the C++ PR [1] to demonstrate the

performance

of

   conversion between the various string formats. In addition,

there

are

   some benchmarks which demonstrate the performance gains

available

with

   the new format [3].
- Adding string view to the C ABI is a natural follow up, but

should

be

   handled independently. An issue has been added to track

that

   enhancement [4].

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/35628
[2] https://github.com/apache/arrow/pull/35769
[3]

https://github.com/apache/arrow/pull/35628#issuecomment-1583218617

[4] https://github.com/apache/arrow/issues/36099

On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <

[email protected]>

wrote:

@Jacob

You mention benchmarks multiple times, are these results

published

somewhere?

I benchmarked the performance of raw pointer vs index

offset

views

in

my

PR to velox,
I do intend to port them to my arrow PR but I haven't

gotten

there

yet.

Furthermore, it
seemed less urgent to me since coexistence of the two types

in

the

c++

implementation
defers the question of how aggressively one should be

preferred

over

the

other.

@Dewey

I don't see the C Data interface in the PR


I have not addressed the C ABI in this PR. As you mention,

it

may

be

useful to transmit
arrays with raw pointer views between implementations which

allow

them. I

can address
this in a follow up PR.

@Will

If I understand correctly, multiple arrays can reference

the

same

buffers

in memory, but once they are written to IPC their data

buffers

will

be

duplicated. Is that right?

The buffers will be duplicated. If buffer duplication is

becomes

concern, I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated

could

be

detected by checking
pointer identity and written only once.


On Wed, May 17, 2023 at 12:07 AM Will Jones <

[email protected]

wrote:

Hello Ben,

Thanks for your work on this. I think this will be an

excellent

addition

to
the format.

If I understand correctly, multiple arrays can reference

the

same

buffers

in memory, but once they are written to IPC their data

buffers

will

be

duplicated. Is that right?

Dictionary types have a special message so they can be

reused

across

batches and even fields. Did we consider adding a similar

message

for

string view buffers?

One relevant use case I'm curious about is substring

extraction.

For

example, if I have a column of URIs and I create columns

where

I've

extracted substrings like the hostname, path, and a list

column

of

query

parameters, I'd like for those latter columns to be views

into

the

URI

buffers, rather than full copies.

However, I've never touched the IPC read code paths, so

it's

quite

possible
I'm overlooking something obvious.

Best,

Will Jones


On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
<[email protected]> wrote:

Very cool!

In addition to performance mentioned above, I could see

this

being

useful for the R bindings - we already have a global

string

pool

and a

mechanism for keeping a vector of them alive.

I don't see the C Data interface in the PR although I

may

have

missed

it - is that a part of the proposal? It seems like it

would

be

possible to use raw pointers as long as they can be

guaranteed

to

be

valid until the release callback is called?

On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
<[email protected]> wrote:


Hello Everyone,
I think keeping interoperability with the large

ecosystem

is a

very

important goal for arrow so I am overall in favor of

this

proposal!


You mention benchmarks multiple times, are these

results

published

somewhere?

Thanks

On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <

[email protected]>

wrote:

Hello all,

As previously discussed on this list [1], an

UmbraDB/DuckDB/Velox

compatible
"string view" type could bring several performance

benefits

to

access

and

authoring of string data in the arrow format [2].

Additionally

better

interoperability with engines already using this

format

could

be

established.

PR #0 [3] adds Utf8View and BinaryView types to the

C++

implementation

and

to
the IPC format. For the purposes of IPC raw pointers

are

not

used.

Instead,

each view contains a pair of 32 bit unsigned

integers

which

encode

the

index of
a character buffer (string view arrays may consist

of

variable

number of

such buffers) and the offset of a view's data within

that

buffer

respectively.
Benefits of this substitution include:
- This makes explicit the guarantee that lifetime of

all

character

data is

equal
   to that of the array which views it, which is

critical

for

confident

   consumption across an interface boundary.
- As with other types in the arrow format, such

arrays

are

serializable and

   venue agnostic; directly usable in shared memory

without

modification.

- Indices and offsets are easily validated.

Accessing the data requires some trivial pointer

arithmetic,

but

in

benchmarking
this had negligible impact on sequential access and

only

minor

impact

on

random
access.

In the C++ implementation, raw pointer string views

are

supported

as an

extended
case of the Utf8View type:

`utf8_view(/*has_raw_pointers=*/true)`.

Branching on
this access pattern bit at the data type level has

negligible

impact on

performance since the branch resides outside any hot

loops.

Utility

functions
are provided for efficient (potentially in-place)

conversion

between

raw

pointer
and index offset views. For example, the C++

implementation

could

zero

copy

a raw pointer array from Velox, filter it, then

convert

to

index/offset for

serialization. Other implementations may choose to

accommodate

or

eschew

raw
pointer views as their communities direct.

Where desirous in a rigorously controlled context

this

still

enables

construction
and safe consumption of string view arrays which

reference

memory

not

directly bound to the lifetime of the array. I'm not

sure

when

or

if we

would
find it useful to have arrays like this; I do not

introduce

any

in

[3]. I

mention
this possibility to highlight that if benchmarking

demonstrates

that

such

an
approach brings a significant performance benefit to

some

operation,

the

only
barrier to its adoption would be code review.

Likewise

if

more

intensive

benchmarking determines that raw pointer views

critically

outperform

index/offset
views for real-world analytics tasks, prioritizing

raw

pointer

string

views

for usage within the C++ implementation will be

straightforward.


See also the proposal to Velox that their string

view

vector

be

refactored

in a similar vein [4].

Sincerely,
Ben Kietzman

[1]

https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

[2]

http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf

[3] https://github.com/apache/arrow/pull/35628
[4]

https://github.com/facebookincubator/velox/discussions/4362

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to