I hope implementations don't start exposing non-standard datatypes over
the C Data Interface (apart from extension types, of course). I would
also be wary of exposing non-standard datatypes in the official Arrow
C++ implementation.
Regards
Antoine.
Le 21/06/2023 à 14:27, Benjamin Kietzman a écrit :
> Ben, at one point there was some discussion that this might be a c-data
only type. However, I believe that was based on the raw pointers
representation. What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)
Yes, what I'm proposing here is an addition to the IPC format which
necessarily
does not reference raw pointers. This is also what's implemented in the
corresponding Go PR [1].
Since other engines which implement this data type *do* use raw pointers,
the c++ implementation can accept raw pointer string view arrays for
purposes
of validation and conversion to the IPC-compatible format.
[1] https://github.com/apache/arrow/pull/35769
On Tue, Jun 20, 2023 at 11:33 AM Weston Pace <weston.p...@gmail.com> wrote:
Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements.
Compared to some of the other layouts we've discussed recently (REE, list
veiw) I do think this layout is more unique and fundamentally different.
Perhaps most fundamentally different:
* This is the first layout where the number of buffers depends on the data
and not the schema. I think this is the most architecturally significant
fact. It does require a (backwards compatible) change to the IPC format
itself, beyond just adding new type codes. It also poses challenges in
places where we've assumed there will be at most 3 buffers (e.g. in
ArraySpan, though, as you have shown, we can work around this using a raw
pointers representation internally in those spots).
I think you've done some great work to integrate this well with Arrow-C++
and I'm convinced it can work.
I would be interested in hearing some input from the Rust community.
Ben, at one point there was some discussion that this might be a c-data
only type. However, I believe that was based on the raw pointers
representation. What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)
On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:
Hi Gang,
I'm not sure what you mean, sorry if my answers are off base:
Parquet's ByteArray will be unaffected by the addition of the string view
type;
all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
during serialization to parquet [1].
If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
as encoding of equivalent arrow::Type::STRING, that's something I haven't
benchmarked so I can't answer definitively. I would expect it to be
faster
than
first converting STRING_VIEW->STRING then encoding to parquet; direct
encoding avoids allocating and populating temporary buffers. Of course
this
only applies to cases where you need to encode an array of STRING_VIEW to
parquet- encoding of STRING to parquet will be unaffected.
Sincerely,
Ben
[1]
https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
On Thu, Jun 15, 2023 at 10:34 PM Gang Wu <ust...@gmail.com> wrote:
Hi Ben,
The posted benchmark [1] looks pretty good to me. However, I want to
raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
uses a customized parquet::ByteArray type [2] for string/binary, I
would
expect some regression of conversions between parquet reader/writer
and the proposed string view array, especially when some strings use
short form and others use long form.
[1]
https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
[2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
Best,
Gang
On Fri, Jun 16, 2023 at 3:58 AM Will Jones <will.jones...@gmail.com>
wrote:
Cool. Thanks for doing that!
On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman <bengil...@gmail.com
wrote:
I've added https://github.com/apache/arrow/issues/36112 to track
deduplication of buffers on write.
I don't think it would require modification of the IPC format.
Ben
On Thu, Jun 15, 2023 at 1:30 PM Matt Topol <zotthewiz...@gmail.com
wrote:
Based on my understanding, in theory a buffer *could* be shared
within
a
batch since the flatbuffers message just uses an offset and
length
to
identify the buffers.
That said, I don't believe any current implementation actually
does
this
or
takes advantage of this in any meaningful way.
--Matt
On Thu, Jun 15, 2023 at 1:00 PM Will Jones <
will.jones...@gmail.com>
wrote:
Hi Ben,
It's exciting to see this move along.
The buffers will be duplicated. If buffer duplication is
becomes
a
concern,
I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated
could
be
detected
by checking
pointer identity and written only once.
Question: to be able to write buffer only once and reference in
multiple
arrays, does that require a change to the IPC format? Or is
sharing
buffers
within the same batch already allowed in the IPC format?
Best,
Will Jones
On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
bengil...@gmail.com
wrote:
Hello again all,
The PR [1] to add string view to the format and the C++
implementation
is
hovering around passing CI and has been undrafted.
Furthermore,
there
is
now also a PR [2] to add string view to the Go
implementation.
Code
review
is underway for each PR and I'd like to move toward a vote
for
acceptance-
are there any other preliminaries which I've neglected?
To reiterate the answers to some past questions:
- Benchmarks are added in the C++ PR [1] to demonstrate the
performance
of
conversion between the various string formats. In addition,
there
are
some benchmarks which demonstrate the performance gains
available
with
the new format [3].
- Adding string view to the C ABI is a natural follow up, but
should
be
handled independently. An issue has been added to track
that
enhancement [4].
Sincerely,
Ben Kietzman
[1] https://github.com/apache/arrow/pull/35628
[2] https://github.com/apache/arrow/pull/35769
[3]
https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
[4] https://github.com/apache/arrow/issues/36099
On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <
bengil...@gmail.com>
wrote:
@Jacob
You mention benchmarks multiple times, are these results
published
somewhere?
I benchmarked the performance of raw pointer vs index
offset
views
in
my
PR to velox,
I do intend to port them to my arrow PR but I haven't
gotten
there
yet.
Furthermore, it
seemed less urgent to me since coexistence of the two types
in
the
c++
implementation
defers the question of how aggressively one should be
preferred
over
the
other.
@Dewey
I don't see the C Data interface in the PR
I have not addressed the C ABI in this PR. As you mention,
it
may
be
useful to transmit
arrays with raw pointer views between implementations which
allow
them. I
can address
this in a follow up PR.
@Will
If I understand correctly, multiple arrays can reference
the
same
buffers
in memory, but once they are written to IPC their data
buffers
will
be
duplicated. Is that right?
The buffers will be duplicated. If buffer duplication is
becomes
a
concern, I'd prefer to handle
that in the ipc writer. Then buffers which are duplicated
could
be
detected by checking
pointer identity and written only once.
On Wed, May 17, 2023 at 12:07 AM Will Jones <
will.jones...@gmail.com
wrote:
Hello Ben,
Thanks for your work on this. I think this will be an
excellent
addition
to
the format.
If I understand correctly, multiple arrays can reference
the
same
buffers
in memory, but once they are written to IPC their data
buffers
will
be
duplicated. Is that right?
Dictionary types have a special message so they can be
reused
across
batches and even fields. Did we consider adding a similar
message
for
string view buffers?
One relevant use case I'm curious about is substring
extraction.
For
example, if I have a column of URIs and I create columns
where
I've
extracted substrings like the hostname, path, and a list
column
of
query
parameters, I'd like for those latter columns to be views
into
the
URI
buffers, rather than full copies.
However, I've never touched the IPC read code paths, so
it's
quite
possible
I'm overlooking something obvious.
Best,
Will Jones
On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:
Very cool!
In addition to performance mentioned above, I could see
this
being
useful for the R bindings - we already have a global
string
pool
and a
mechanism for keeping a vector of them alive.
I don't see the C Data interface in the PR although I
may
have
missed
it - is that a part of the proposal? It seems like it
would
be
possible to use raw pointers as long as they can be
guaranteed
to
be
valid until the release callback is called?
On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
<ja...@voltrondata.com.invalid> wrote:
Hello Everyone,
I think keeping interoperability with the large
ecosystem
is a
very
important goal for arrow so I am overall in favor of
this
proposal!
You mention benchmarks multiple times, are these
results
published
somewhere?
Thanks
On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <
bengil...@gmail.com>
wrote:
Hello all,
As previously discussed on this list [1], an
UmbraDB/DuckDB/Velox
compatible
"string view" type could bring several performance
benefits
to
access
and
authoring of string data in the arrow format [2].
Additionally
better
interoperability with engines already using this
format
could
be
established.
PR #0 [3] adds Utf8View and BinaryView types to the
C++
implementation
and
to
the IPC format. For the purposes of IPC raw pointers
are
not
used.
Instead,
each view contains a pair of 32 bit unsigned
integers
which
encode
the
index of
a character buffer (string view arrays may consist
of
a
variable
number of
such buffers) and the offset of a view's data within
that
buffer
respectively.
Benefits of this substitution include:
- This makes explicit the guarantee that lifetime of
all
character
data is
equal
to that of the array which views it, which is
critical
for
confident
consumption across an interface boundary.
- As with other types in the arrow format, such
arrays
are
serializable and
venue agnostic; directly usable in shared memory
without
modification.
- Indices and offsets are easily validated.
Accessing the data requires some trivial pointer
arithmetic,
but
in
benchmarking
this had negligible impact on sequential access and
only
minor
impact
on
random
access.
In the C++ implementation, raw pointer string views
are
supported
as an
extended
case of the Utf8View type:
`utf8_view(/*has_raw_pointers=*/true)`.
Branching on
this access pattern bit at the data type level has
negligible
impact on
performance since the branch resides outside any hot
loops.
Utility
functions
are provided for efficient (potentially in-place)
conversion
between
raw
pointer
and index offset views. For example, the C++
implementation
could
zero
copy
a raw pointer array from Velox, filter it, then
convert
to
index/offset for
serialization. Other implementations may choose to
accommodate
or
eschew
raw
pointer views as their communities direct.
Where desirous in a rigorously controlled context
this
still
enables
construction
and safe consumption of string view arrays which
reference
memory
not
directly bound to the lifetime of the array. I'm not
sure
when
or
if we
would
find it useful to have arrays like this; I do not
introduce
any
in
[3]. I
mention
this possibility to highlight that if benchmarking
demonstrates
that
such
an
approach brings a significant performance benefit to
some
operation,
the
only
barrier to its adoption would be code review.
Likewise
if
more
intensive
benchmarking determines that raw pointer views
critically
outperform
index/offset
views for real-world analytics tasks, prioritizing
raw
pointer
string
views
for usage within the C++ implementation will be
straightforward.
See also the proposal to Velox that their string
view
vector
be
refactored
in a similar vein [4].
Sincerely,
Ben Kietzman
[1]
https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2]
http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
[3] https://github.com/apache/arrow/pull/35628
[4]
https://github.com/facebookincubator/velox/discussions/4362