Hi,

Is the motivation here to avoid DuckDB and Velox having to duplicate the conversion logic from pointer-based to offset-based, or to allow arrow-cpp to operate directly on pointer-based arrays?

If it is the former, I personally wouldn't have thought the conversion logic sufficiently complex to really warrant this?

If it is the latter, I wonder if you have some benchmark numbers for converting between and operating on the differing representations? In the absence of a strong performance case, it's hard in my opinion to justify adding what will be an arrow-cpp specific extension that isn't part of the standard, with all the potential for confusion and interoperability challenges that entails.

Kind Regards,

Raphael

On 26/09/2023 21:34, Benjamin Kietzman wrote:
Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4

Reply via email to