> I would be interested in hearing some input from the Rust community.

 A couple of thoughts:

The variable number of buffers would definitely pose some challenges for the 
Rust implementation, the closest thing we currently have is possibly 
UnionArray, but even then the number of buffers is still determined statically 
by the DataType. I therefore also wonder about the possibility of always having 
a single backing buffer that stores the character data, including potentially a 
copy of the prefix. This would also avoid forcing a branch on access, which I 
would have expected to hurt performance for some kernels quite significantly.

Whilst not really a concern for Rust, which supports unsigned types, it does 
seem inconsistent to use unsigned types where the rest of the format encourages 
the use of signed offsets, etc...

It isn't clearly specified whether a null should have a valid set of offsets, 
etc... I think it is an important property of the current array layouts that, 
with exception to dictionaries, the data in null slots is arbitrary, i.e. can 
take any value, but not undefined. This allows for separate handling of the 
null mask and values, which can be important for some kernels and APIs.

More an observation than an issue, but UTF-8 validation for StringArray can be 
done very efficiently by first verifying the entire buffer, and then verifying 
the offsets correspond to the start of a UTF-8 codepoint. This same approach 
would not be possible for StringView, which would need to verify individual 
values and would therefore be significantly more expensive. As it is UB for a 
Rust string to contain non-UTF-8 data, this validation is perhaps more 
important for Rust than for other languages.

I presume that StringView will behave similarly to dictionaries in that the 
selection kernels will not recompute the underlying value buffers. I think this 
is fine, but it is perhaps worth noting this has caused confusion in the past, 
as people somewhat reasonably expect an array post-selection to have memory 
usage reflecting the smaller selection. This is then especially noticeable if 
the data is written out to IPC, and still contains data that was supposedly 
filtered out. My 2 cents is that explicit selection vectors are a less 
surprising way to defer selection than baking it into the array, but I also 
don't have any workloads where this is the major bottleneck so can't speak 
authoritatively here.

Which leads on to my major concern with this proposal, that it adds complexity 
and cognitive load to the specification and implementations, whilst not 
meaningfully improving the performance of the operators that I commonly 
encounter as performance bottlenecks, which are multi-column sorts and 
aggregations, or the expensive string operations such as matching or parsing. 
If we didn't already have a string representation I would be more onboard, but 
as it stands I'm definitely on the fence, especially given selection 
performance can be improved in less intrusive ways using dictionaries or 
selection vectors.

Kind Regards,

Raphael Taylor-Davies

On 02/07/2023 11:46, Andrew Lamb wrote:

 * This is the first layout where the number of buffers depends on the 

data 

and not the schema. I think this is the most architecturally significant fact. 
I 

 I have spent some time reading the initial proposal -- thank you for that. I 
now understand what Weston was saying about the "variable numbers of buffers". 
I wonder if you considered restricting such arrays to a single buffer (so as to 
make them more similar to other arrow array types that have a fixed number of 
buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston Pace <weston.p...@gmail.com> 
<mailto:weston.p...@gmail.com> wrote: 

Before I say anything else I'll say that I am in favor of this new layout. 
There is some existing literature on the idea (e.g. umbra) and your benchmarks 
show some nice improvements. Compared to some of the other layouts we've 
discussed recently (REE, list veiw) I do think this layout is more unique and 
fundamentally different. Perhaps most fundamentally different: * This is the 
first layout where the number of buffers depends on the data and not the 
schema. I think this is the most architecturally significant fact. It does 
require a (backwards compatible) change to the IPC format itself, beyond just 
adding new type codes. It also poses challenges in places where we've assumed 
there will be at most 3 buffers (e.g. in ArraySpan, though, as you have shown, 
we can work around this using a raw pointers representation internally in those 
spots). I think you've done some great work to integrate this well with 
Arrow-C++ and I'm convinced it can work. I would be interested in hearing some 
input from the Rust community. Ben, at one point there was some discussion that 
this might be a c-data only type. However, I believe that was based on the raw 
pointers representation. What you've proposed here, if I understand correctly, 
is an index + offsets representation and it is suitable for IPC correct? (e.g. 
I see that you have changes and examples in the IPC reader/writer) On Mon, Jun 
19, 2023 at 7:17 AM Benjamin Kietzman <bengil...@gmail.com> 
<mailto:bengil...@gmail.com> wrote: 

Hi Gang, I'm not sure what you mean, sorry if my answers are off base: 
Parquet's ByteArray will be unaffected by the addition of the string view type; 
all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and with 
this patch arrow::Type::STRING_VIEW) are converted to ByteArrays during 
serialization to parquet [1]. If you mean that encoding of 
arrow::Type::STRING_VIEW will not be as fast as encoding of equivalent 
arrow::Type::STRING, that's something I haven't benchmarked so I can't answer 
definitively. I would expect it to be 

faster 

than first converting STRING_VIEW->STRING then encoding to parquet; direct 
encoding avoids allocating and populating temporary buffers. Of course 

this 

only applies to cases where you need to encode an array of STRING_VIEW to 
parquet- encoding of STRING to parquet will be unaffected. Sincerely, Ben [1] 

https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
 

 On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> 
<mailto:ust...@gmail.com> wrote: 

Hi Ben, The posted benchmark [1] looks pretty good to me. However, I want to 
raise a possible issue from the perspective of parquet-cpp. Parquet-cpp uses a 
customized parquet::ByteArray type [2] for string/binary, I 

would 

expect some regression of conversions between parquet reader/writer and the 
proposed string view array, especially when some strings use short form and 
others use long form. [1] 

https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
 

[2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 Best, 
Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <will.jones...@gmail.com> 
<mailto:will.jones...@gmail.com> wrote: 

Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman 
<bengil...@gmail.com <mailto:bengil...@gmail.com> 

wrote: 

I've added https://github.com/apache/arrow/issues/36112 to track deduplication 
of buffers on write. I don't think it would require modification of the IPC 
format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <zotthewiz...@gmail.com 
<mailto:zotthewiz...@gmail.com> 

wrote: 

Based on my understanding, in theory a buffer *could* be shared 

within 

a 

batch since the flatbuffers message just uses an offset and 

length 

to 

identify the buffers. That said, I don't believe any current implementation 
actually 

does 

this 

or 

takes advantage of this in any meaningful way. --Matt On Thu, Jun 15, 2023 at 
1:00 PM Will Jones < 

will.jones...@gmail.com <mailto:will.jones...@gmail.com>> 

wrote: 

Hi Ben, It's exciting to see this move along. The buffers will be duplicated. 
If buffer duplication is 

becomes 

a 

concern, 

I'd prefer to handle that in the ipc writer. Then buffers which are duplicated 

could 

be 

detected 

by checking pointer identity and written only once. 

 Question: to be able to write buffer only once and reference in 

multiple 

arrays, does that require a change to the IPC format? Or is 

sharing 

buffers 

within the same batch already allowed in the IPC format? Best, Will Jones On 
Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < 

bengil...@gmail.com <mailto:bengil...@gmail.com> 

wrote: 

Hello again all, The PR [1] to add string view to the format and the C++ 

implementation 

is 

hovering around passing CI and has been undrafted. 

Furthermore, 

there 

is 

now also a PR [2] to add string view to the Go 

implementation. 

Code 

review 

is underway for each PR and I'd like to move toward a vote 

for 

acceptance- 

are there any other preliminaries which I've neglected? To reiterate the 
answers to some past questions: - Benchmarks are added in the C++ PR [1] to 
demonstrate the 

performance 

of 

 conversion between the various string formats. In addition, 

there 

are 

 some benchmarks which demonstrate the performance gains 

available 

with 

 the new format [3]. - Adding string view to the C ABI is a natural follow up, 
but 

should 

be 

 handled independently. An issue has been added to track 

that 

 enhancement [4]. Sincerely, Ben Kietzman [1] 
https://github.com/apache/arrow/pull/35628 [2] 
https://github.com/apache/arrow/pull/35769 [3] 

https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 

[4] https://github.com/apache/arrow/issues/36099 On Wed, May 17, 2023 at 12:53 
PM Benjamin Kietzman < 

bengil...@gmail.com <mailto:bengil...@gmail.com>> 

wrote: 

@Jacob 

You mention benchmarks multiple times, are these results 

published 

somewhere? I benchmarked the performance of raw pointer vs index 

offset 

views 

in 

my 

PR to velox, I do intend to port them to my arrow PR but I haven't 

gotten 

there 

yet. 

Furthermore, it seemed less urgent to me since coexistence of the two types 

in 

the 

c++ 

implementation defers the question of how aggressively one should be 

preferred 

over 

the 

other. @Dewey 

I don't see the C Data interface in the PR 

 I have not addressed the C ABI in this PR. As you mention, 

it 

may 

be 

useful to transmit arrays with raw pointer views between implementations which 

allow 

them. I 

can address this in a follow up PR. @Will 

If I understand correctly, multiple arrays can reference 

the 

same 

buffers 

in memory, but once they are written to IPC their data 

Reply via email to