I would be interested in hearing some input from the Rust community.
A couple of thoughts:
The variable number of buffers would definitely pose some challenges for
the Rust implementation, the closest thing we currently have is possibly
UnionArray, but even then the number of buffers is still determined
statically by the DataType. I therefore also wonder about the possibility
of always having a single backing buffer that stores the character data,
including potentially a copy of the prefix. This would also avoid forcing a
branch on access, which I would have expected to hurt performance for some
kernels quite significantly.
Whilst not really a concern for Rust, which supports unsigned types, it
does seem inconsistent to use unsigned types where the rest of the format
encourages the use of signed offsets, etc...
It isn't clearly specified whether a null should have a valid set of
offsets, etc... I think it is an important property of the current array
layouts that, with exception to dictionaries, the data in null slots is
arbitrary, i.e. can take any value, but not undefined. This allows for
separate handling of the null mask and values, which can be important for
some kernels and APIs.
More an observation than an issue, but UTF-8 validation for StringArray
can be done very efficiently by first verifying the entire buffer, and then
verifying the offsets correspond to the start of a UTF-8 codepoint. This
same approach would not be possible for StringView, which would need to
verify individual values and would therefore be significantly more
expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
validation is perhaps more important for Rust than for other languages.
I presume that StringView will behave similarly to dictionaries in that
the selection kernels will not recompute the underlying value buffers. I
think this is fine, but it is perhaps worth noting this has caused
confusion in the past, as people somewhat reasonably expect an array
post-selection to have memory usage reflecting the smaller selection. This
is then especially noticeable if the data is written out to IPC, and still
contains data that was supposedly filtered out. My 2 cents is that explicit
selection vectors are a less surprising way to defer selection than baking
it into the array, but I also don't have any workloads where this is the
major bottleneck so can't speak authoritatively here.
Which leads on to my major concern with this proposal, that it adds
complexity and cognitive load to the specification and implementations,
whilst not meaningfully improving the performance of the operators that I
commonly encounter as performance bottlenecks, which are multi-column sorts
and aggregations, or the expensive string operations such as matching or
parsing. If we didn't already have a string representation I would be more
onboard, but as it stands I'm definitely on the fence, especially given
selection performance can be improved in less intrusive ways using
dictionaries or selection vectors.
Kind Regards,
Raphael Taylor-Davies
On 02/07/2023 11:46, Andrew Lamb wrote:
* This is the first layout where the number of buffers depends on the
data
and not the schema. I think this is the most architecturally significant
fact. I
I have spent some time reading the initial proposal -- thank you for
that. I now understand what Weston was saying about the "variable numbers
of buffers". I wonder if you considered restricting such arrays to a single
buffer (so as to make them more similar to other arrow array types that
have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
Pace<weston.p...@gmail.com> <mailto:weston.p...@gmail.com> wrote:
Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements. Compared to some of the other
layouts we've discussed recently (REE, list veiw) I do think this layout is
more unique and fundamentally different. Perhaps most fundamentally
different: * This is the first layout where the number of buffers depends
on the data and not the schema. I think this is the most architecturally
significant fact. It does require a (backwards compatible) change to the
IPC format itself, beyond just adding new type codes. It also poses
challenges in places where we've assumed there will be at most 3 buffers
(e.g. in ArraySpan, though, as you have shown, we can work around this
using a raw pointers representation internally in those spots). I think
you've done some great work to integrate this well with Arrow-C++ and I'm
convinced it can work. I would be interested in hearing some input from the
Rust community. Ben, at one point there was some discussion that this might
be a c-data only type. However, I believe that was based on the raw
pointers representation. What you've proposed here, if I understand
correctly, is an index + offsets representation and it is suitable for IPC
correct? (e.g. I see that you have changes and examples in the IPC
reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
bengil...@gmail.com> <mailto:bengil...@gmail.com> wrote:
Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
Parquet's ByteArray will be unaffected by the addition of the string view
type; all arrow strings (arrow::Type::STRING,arrow::Type::LARGE_STRING,
and with this patcharrow::Type::STRING_VIEW) are converted to ByteArrays
during serialization to parquet [1]. If you mean that encoding of
arrow::Type::STRING_VIEW will not be as fast as encoding of equivalent
arrow::Type::STRING, that's something I haven't benchmarked so I can't
answer definitively. I would expect it to be
faster
than first converting STRING_VIEW->STRING then encoding to parquet; direct
encoding avoids allocating and populating temporary buffers. Of course
this
only applies to cases where you need to encode an array of STRING_VIEW to
parquet- encoding of STRING to parquet will be unaffected. Sincerely, Ben
[1]
https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com> <mailto:
ust...@gmail.com> wrote:
Hi Ben, The posted benchmark [1] looks pretty good to me. However, I want
to raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
uses a customizedparquet::ByteArray type [2] for string/binary, I
would
expect some regression of conversions between parquet reader/writer and
the proposed string view array, especially when some strings use short form
and others use long form. [1]
https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
[2]https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
will.jones...@gmail.com> <mailto:will.jones...@gmail.com> wrote:
Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
Kietzman <bengil...@gmail.com <mailto:bengil...@gmail.com>
wrote:
I've addedhttps://github.com/apache/arrow/issues/36112 to track
deduplication of buffers on write. I don't think it would require
modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt
Topol <zotthewiz...@gmail.com <mailto:zotthewiz...@gmail.com>
wrote:
Based on my understanding, in theory a buffer *could* be shared
within
a
batch since the flatbuffers message just uses an offset and
length
to
identify the buffers. That said, I don't believe any current
implementation actually
does
this
or
takes advantage of this in any meaningful way. --Matt On Thu, Jun 15, 2023
at 1:00 PM Will Jones <
will.jones...@gmail.com <mailto:will.jones...@gmail.com>>
wrote:
Hi Ben, It's exciting to see this move along. The buffers will be
duplicated. If buffer duplication is
becomes
a
concern,
I'd prefer to handle that in the ipc writer. Then buffers which are
duplicated
could
be
detected
by checking pointer identity and written only once.
Question: to be able to write buffer only once and reference in
multiple
arrays, does that require a change to the IPC format? Or is
sharing
buffers
within the same batch already allowed in the IPC format? Best, Will Jones
On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
bengil...@gmail.com <mailto:bengil...@gmail.com>
wrote:
Hello again all, The PR [1] to add string view to the format and the C++
implementation
is
hovering around passing CI and has been undrafted.
Furthermore,
there
is
now also a PR [2] to add string view to the Go
implementation.
Code
review
is underway for each PR and I'd like to move toward a vote
for
acceptance-
are there any other preliminaries which I've neglected? To reiterate the
answers to some past questions: - Benchmarks are added in the C++ PR [1] to
demonstrate the
performance
of
conversion between the various string formats. In addition,
there
are
some benchmarks which demonstrate the performance gains
available
with
the new format [3]. - Adding string view to the C ABI is a natural follow
up, but
should
be
handled independently. An issue has been added to track
that
enhancement [4]. Sincerely, Ben Kietzman [1]
https://github.com/apache/arrow/pull/35628 [2]
https://github.com/apache/arrow/pull/35769 [3]
https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
[4]https://github.com/apache/arrow/issues/36099 On Wed, May 17, 2023 at
12:53 PM Benjamin Kietzman <
bengil...@gmail.com <mailto:bengil...@gmail.com>>
wrote:
@Jacob
You mention benchmarks multiple times, are these results
published
somewhere? I benchmarked the performance of raw pointer vs index
offset
views
in
my
PR to velox, I do intend to port them to my arrow PR but I haven't
gotten
there
yet.
Furthermore, it seemed less urgent to me since coexistence of the two
types
in
the
c++
implementation defers the question of how aggressively one should be
preferred
over
the
other. @Dewey
I don't see the C Data interface in the PR
I have not addressed the C ABI in this PR. As you mention,
it
may
be
useful to transmit arrays with raw pointer views between implementations
which
allow
them. I
can address this in a follow up PR. @Will
If I understand correctly, multiple arrays can reference
the
same
buffers
in memory, but once they are written to IPC their data