Hi,
Is the motivation here to avoid DuckDB and Velox having to duplicate the
conversion logic from pointer-based to offset-based, or to allow
arrow-cpp to operate directly on pointer-based arrays?
If it is the former, I personally wouldn't have thought the conversion
logic sufficiently complex to really warrant this?
If it is the latter, I wonder if you have some benchmark numbers for
converting between and operating on the differing representations? In
the absence of a strong performance case, it's hard in my opinion to
justify adding what will be an arrow-cpp specific extension that isn't
part of the standard, with all the potential for confusion and
interoperability challenges that entails.
Kind Regards,
Raphael
On 26/09/2023 21:34, Benjamin Kietzman wrote:
Hello all,
In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.
However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].
Sincerely,
Ben Kietzman
[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4